Architecture (conceptual)
Input traditionally 224×224 (after crop). Five convolutional stages with ReLU and max pooling (original paper used overlapping pooling in places). Three large fully connected layers (4096, 4096, 1000) with dropout. Local Response Normalization (LRN) appeared in the original paper; torchvision’s implementation may omit LRN in favor of batch-oriented training practices—check the version you use.
ReLU
Faster training than saturating tanh/sigmoid for deep nets at the time.
Dropout
Regularizes the huge FC parameters to reduce co-adaptation.
Load pretrained weights
import torch
from torchvision.models import alexnet, AlexNet_Weights
weights = AlexNet_Weights.IMAGENET1K_V1
model = alexnet(weights=weights).eval()
preprocess = weights.transforms()
print(preprocess)
Random init (train from scratch)
model_scratch = alexnet(weights=None)
Single image → class logits
from PIL import Image
img = Image.open("cat.jpg").convert("RGB")
batch = preprocess(img).unsqueeze(0)
with torch.no_grad():
logits = model(batch)
probs = logits.softmax(dim=1)
top5 = probs.topk(5, dim=1)
# Map indices to labels
categories = weights.meta["categories"]
for score, idx in zip(top5.values[0], top5.indices[0]):
print(f"{categories[idx]}: {float(score):.4f}")
4096-D embedding (before classifier)
# torchvision alexnet: features → avgpool → classifier (fc layers)
with torch.no_grad():
x = model.features(batch)
x = model.avgpool(x)
x = torch.flatten(x, 1)
# Default torchvision alexnet: classifier[6] is Linear(4096, 1000)
vec = model.classifier[:6](x) # through second FC + ReLU → 4096-D
print(vec.shape)
Confirm with print(model.classifier)—slices change if the head was replaced for fine-tuning.
Alternative: hook after second ReLU
activation = {}
def get(name):
def hook(m, i, o):
activation[name] = o.detach()
return hook
h = model.classifier[5].register_forward_hook(get("fc4096_relu"))
_ = model(batch)
h.remove()
feat = activation["fc4096_relu"]
Mini-batch
from PIL import Image
paths = ["a.jpg", "b.jpg", "c.jpg"]
tensors = [preprocess(Image.open(p).convert("RGB")) for p in paths]
xb = torch.stack(tensors, dim=0)
with torch.no_grad():
out = model(xb)
print(out.shape) # [3, 1000]
weights.transforms() handles resize, ToTensor, and ImageNet normalization for PIL or tensor inputs per torchvision version.
Fine-tune last layer (sketch)
import torch.nn as nn
num_classes = 10
model_ft = alexnet(weights=weights)
model_ft.classifier[6] = nn.Linear(4096, num_classes)
# freeze earlier layers optionally, then train with your dataloader
Takeaways
- AlexNet = deep conv stacks + large FC + ReLU/dropout—ImageNet 2012 breakthrough.
- Use
AlexNet_Weightstransforms for correct normalization. - For transfer learning, replace the final
Linear(4096, 1000)with your class count.
Quick FAQ
avgpool. Changing input resolution may break shape; keep 224 or redesign the head.