Convolutional Neural Networks: The Vision Architecture
CNNs revolutionized computer vision by learning hierarchical feature representations. From edge detection to semantic understanding — complete guide covering convolution math, layer design, modern architectures, and implementation.
Conv2D
Kernels, stride, padding
Pooling
Downsampling
Residual
Skip connections
EfficientNet
Compound scaling
What is a Convolutional Neural Network?
A Convolutional Neural Network (CNN) is a specialized neural architecture designed for processing grid-structured data such as images. Instead of fully connected layers, CNNs use convolutional layers that learn spatial hierarchies of patterns — from edges and textures to object parts and complete objects.
CNNs automatically learn spatial feature hierarchies via backpropagation.
Convolution Arithmetic: Kernels, Stride, Padding
2D Convolution
(I * K)(i,j) = Σₘ Σₙ I(i+m, j+n) · K(m,n)
Input I, kernel/filter K. Output size = (H - Kₕ + 2P)/S + 1 × (W - K_w + 2P)/S + 1
Kernel: 3x3, 5x7 Stride (S): 1 or 2 Padding (P): same/valid
Receptive Field
Each neuron in deeper layers sees a larger region of the input. Stacking 3x3 convs: 3 layers → receptive field 7x7.
RF = fₖ₋₁ + (k - 1) * ∏_{i=1}^{k-1} sᵢ
import numpy as np
def conv2d(image, kernel, stride=1, padding=0):
# image: (H, W), kernel: (kH, kW)
H, W = image.shape
kH, kW = kernel.shape
out_h = (H - kH + 2*padding) // stride + 1
out_w = (W - kW + 2*padding) // stride + 1
if padding > 0:
image = np.pad(image, pad_width=padding, mode='constant')
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
h_start = i * stride
w_start = j * stride
output[i,j] = np.sum(image[h_start:h_start+kH, w_start:w_start+kW] * kernel)
return output
Pooling and Activation Functions
Max Pooling
Selects maximum value in window. Reduces spatial size, provides translation invariance.
2x2, stride 2 → downsample 50%
Average Pooling
Average of values. Smoother but edges less preserved. Used in modern architectures.
ReLU
max(0, x). Standard activation. Variants: LeakyReLU, ELU, GELU.
Feature Hierarchy: From Edges to Objects
Layer 1: Low-level features
Gabor filters: edges, corners, color blobs. Generally 3x3 or 5x5 kernels.
Layer 2: Mid-level features
Textures, patterns, part of objects (eyes, wheels).
Layer 3-4: High-level features
Object parts, semantic concepts (faces, animals, cars).
Fully Connected
Global reasoning, classification scores.
Feature visualization: Early layers → edges / Later layers → entire objects. Learned via backpropagation.
Iconic CNN Architectures (2012–2024)
LeNet-5 (1998)
Origin. 2 conv + pooling, 3 FC. Digit recognition.
AlexNet (2012)
Breakthrough. ReLU, Dropout, GPU training. 5 conv + 3 FC.
VGGNet (2014)
Simplicity: 3x3 conv, deeper (16-19 layers).
ResNet (2015)
Skip connections → train up to 152 layers. Residual learning. Solves degradation.
Output = F(x) + x
DenseNet (2017)
Concatenate all previous feature maps. Feature reuse.
EfficientNet (2019)
Compound scaling: depth, width, resolution. Neural Architecture Search.
Inception / GoogLeNet
Parallel conv 1x1, 3x3, 5x5, pooling → concatenate. Efficient.
MobileNet
Depthwise separable convolution. Lightweight for edge.
Transfer Learning with CNNs
Rarely train CNNs from scratch. Use models pretrained on ImageNet (1.2M images).
Feature Extractor
Freeze conv base, train new classifier on top. Fast, small dataset.
Fine-tuning
Unfreeze some layers, train with lower LR. Adapt to domain.
import torchvision.models as models
# Load pretrained ResNet
resnet = models.resnet50(weights='IMAGENET1K_V2')
# Freeze parameters
for param in resnet.parameters():
param.requires_grad = False
# Replace classifier head
num_features = resnet.fc.in_features
resnet.fc = torch.nn.Linear(num_features, num_classes) # e.g., 10 classes
# Train only the new head
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=0.001)
CNN in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', padding='same', input_shape=(32,32,3)),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(64, 3, activation='relu', padding='same'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(128, 3, activation='relu', padding='same'),
tf.keras.layers.GlobalAvgPool2D(),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
PyTorch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2,2),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2,2),
nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1)
)
self.classifier = nn.Linear(128, num_classes)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
model = SimpleCNN()
CNN Training: Hyperparameters & Best Practices
CNNs Beyond Image Classification
Object Detection
YOLO, Faster R-CNN, SSD. CNNs for localization + classification.
Semantic Segmentation
U-Net, DeepLab, FCN. Pixel-wise classification.
Video Analysis
3D CNNs, I3D, C3D. Spatiotemporal filters.
Generative Models
DCGAN, StyleGAN. CNN generators/discriminators.
Medical Imaging
MRI, CT, histopathology. Pretrained CNNs as feature extractors.
Self-driving
Lane detection, obstacle recognition.
CNN Architectures & Use Cases – Cheatsheet
Architecture Comparison
| Architecture | Year | Depth | Key Innovation | Top-1 ImageNet |
|---|---|---|---|---|
| AlexNet | 2012 | 8 | ReLU, Dropout, GPU | ~57% |
| VGG-16 | 2014 | 16 | 3x3 only | ~71% |
| ResNet-50 | 2015 | 50 | Skip connections | ~76% |
| DenseNet-121 | 2017 | 121 | Dense blocks | ~75% |
| EfficientNet-B7 | 2019 | ~80 | Compound scaling, NAS | ~84% |