AlexNet: CV guide

Architecture (conceptual)

Input traditionally 224×224 (after crop). Five convolutional stages with ReLU and max pooling (original paper used overlapping pooling in places). Three large fully connected layers (4096, 4096, 1000) with dropout. Local Response Normalization (LRN) appeared in the original paper; torchvision’s implementation may omit LRN in favor of batch-oriented training practices—check the version you use.

ReLU

Faster training than saturating tanh/sigmoid for deep nets at the time.

Dropout

Regularizes the huge FC parameters to reduce co-adaptation.

Load pretrained weights

import torch
from torchvision.models import alexnet, AlexNet_Weights

weights = AlexNet_Weights.IMAGENET1K_V1
model = alexnet(weights=weights).eval()

preprocess = weights.transforms()
print(preprocess)

Random init (train from scratch)

model_scratch = alexnet(weights=None)

Single image → class logits

from PIL import Image

img = Image.open("cat.jpg").convert("RGB")
batch = preprocess(img).unsqueeze(0)

with torch.no_grad():
    logits = model(batch)
probs = logits.softmax(dim=1)
top5 = probs.topk(5, dim=1)

# Map indices to labels
categories = weights.meta["categories"]
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"{categories[idx]}: {float(score):.4f}")

4096-D embedding (before classifier)

# torchvision alexnet: features → avgpool → classifier (fc layers)
with torch.no_grad():
    x = model.features(batch)
    x = model.avgpool(x)
    x = torch.flatten(x, 1)
    # Default torchvision alexnet: classifier[6] is Linear(4096, 1000)
    vec = model.classifier[:6](x)   # through second FC + ReLU → 4096-D
print(vec.shape)

Confirm with print(model.classifier)—slices change if the head was replaced for fine-tuning.

Alternative: hook after second ReLU

activation = {}
def get(name):
    def hook(m, i, o):
        activation[name] = o.detach()
    return hook

h = model.classifier[5].register_forward_hook(get("fc4096_relu"))
_ = model(batch)
h.remove()
feat = activation["fc4096_relu"]

Mini-batch

from PIL import Image

paths = ["a.jpg", "b.jpg", "c.jpg"]
tensors = [preprocess(Image.open(p).convert("RGB")) for p in paths]
xb = torch.stack(tensors, dim=0)

with torch.no_grad():
    out = model(xb)
print(out.shape)  # [3, 1000]

weights.transforms() handles resize, ToTensor, and ImageNet normalization for PIL or tensor inputs per torchvision version.

Fine-tune last layer (sketch)

import torch.nn as nn

num_classes = 10
model_ft = alexnet(weights=weights)
model_ft.classifier[6] = nn.Linear(4096, num_classes)
# freeze earlier layers optionally, then train with your dataloader

                    Takeaways
                    AlexNet = deep conv stacks + large FC + ReLU/dropout—ImageNet 2012 breakthrough.
Use AlexNet_Weights transforms for correct normalization.
For transfer learning, replace the final Linear(4096, 1000) with your class count.

                

Quick FAQ

Rarely for accuracy—ResNet/EfficientNet/ViT families dominate. AlexNet remains useful for teaching and lightweight baselines on small data with heavy regularization.

Pretrained FC layers expect a fixed flattened size from avgpool. Changing input resolution may break shape; keep 224 or redesign the head.

Related Computer Vision Links