Computer Vision Chapter 30

AlexNet

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) won ImageNet ILSVRC with a large jump in top-1/top-5 accuracy—often credited with sparking the modern deep-learning wave in vision. It stacks conv–ReLU–pool blocks, uses dropout in heavy fully connected layers, and was trained on two GPUs with a split architecture. Today it is a historical baseline (VGG, ResNet, EfficientNet surpass it), but it remains the canonical teaching example. Below: architecture sketch, torchvision loading, ImageNet-style preprocessing, logits, 4096-D features before the classifier, and top-k decoding—with multiple code blocks.

Architecture (conceptual)

Input traditionally 224×224 (after crop). Five convolutional stages with ReLU and max pooling (original paper used overlapping pooling in places). Three large fully connected layers (4096, 4096, 1000) with dropout. Local Response Normalization (LRN) appeared in the original paper; torchvision’s implementation may omit LRN in favor of batch-oriented training practices—check the version you use.

ReLU

Faster training than saturating tanh/sigmoid for deep nets at the time.

Dropout

Regularizes the huge FC parameters to reduce co-adaptation.

Load pretrained weights

import torch
from torchvision.models import alexnet, AlexNet_Weights

weights = AlexNet_Weights.IMAGENET1K_V1
model = alexnet(weights=weights).eval()

preprocess = weights.transforms()
print(preprocess)

Random init (train from scratch)

model_scratch = alexnet(weights=None)

Single image → class logits

from PIL import Image

img = Image.open("cat.jpg").convert("RGB")
batch = preprocess(img).unsqueeze(0)

with torch.no_grad():
    logits = model(batch)
probs = logits.softmax(dim=1)
top5 = probs.topk(5, dim=1)

# Map indices to labels
categories = weights.meta["categories"]
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"{categories[idx]}: {float(score):.4f}")

4096-D embedding (before classifier)

# torchvision alexnet: features → avgpool → classifier (fc layers)
with torch.no_grad():
    x = model.features(batch)
    x = model.avgpool(x)
    x = torch.flatten(x, 1)
    # Default torchvision alexnet: classifier[6] is Linear(4096, 1000)
    vec = model.classifier[:6](x)   # through second FC + ReLU → 4096-D
print(vec.shape)

Confirm with print(model.classifier)—slices change if the head was replaced for fine-tuning.

Alternative: hook after second ReLU

activation = {}
def get(name):
    def hook(m, i, o):
        activation[name] = o.detach()
    return hook

h = model.classifier[5].register_forward_hook(get("fc4096_relu"))
_ = model(batch)
h.remove()
feat = activation["fc4096_relu"]

Mini-batch

from PIL import Image

paths = ["a.jpg", "b.jpg", "c.jpg"]
tensors = [preprocess(Image.open(p).convert("RGB")) for p in paths]
xb = torch.stack(tensors, dim=0)

with torch.no_grad():
    out = model(xb)
print(out.shape)  # [3, 1000]

weights.transforms() handles resize, ToTensor, and ImageNet normalization for PIL or tensor inputs per torchvision version.

Fine-tune last layer (sketch)

import torch.nn as nn

num_classes = 10
model_ft = alexnet(weights=weights)
model_ft.classifier[6] = nn.Linear(4096, num_classes)
# freeze earlier layers optionally, then train with your dataloader

Takeaways

  • AlexNet = deep conv stacks + large FC + ReLU/dropout—ImageNet 2012 breakthrough.
  • Use AlexNet_Weights transforms for correct normalization.
  • For transfer learning, replace the final Linear(4096, 1000) with your class count.

Quick FAQ

Rarely for accuracy—ResNet/EfficientNet/ViT families dominate. AlexNet remains useful for teaching and lightweight baselines on small data with heavy regularization.

Pretrained FC layers expect a fixed flattened size from avgpool. Changing input resolution may break shape; keep 224 or redesign the head.