Neural Networks Optimization
Adam / SGD AdamW

Optimizers (SGD, Adam, …)

An optimizer implements the rule that turns gradients into parameter updates each step. Vanilla SGD subtracts η times the gradient. Momentum accumulates a velocity vector to smooth noisy gradients and escape shallow ravines. Adaptive methods (RMSprop, Adam) scale each parameter’s update using recent squared-gradient statistics so effective step sizes differ per weight. In practice AdamW (Adam with decoupled weight decay) is a strong default for many vision and language fine-tuning setups; SGD + momentum with tuned learning rate still competes on large-batch ImageNet-style training.

learning rate η β₁, β₂ batch size PyTorch

SGD and Momentum

Stochastic gradient descent: θ ← θ − η ∇L. Mini-batches approximate the full-dataset gradient cheaply. Without momentum, updates zig-zag in ill-conditioned valleys.

Momentum maintains v ← βv + ∇L and updates θ ← θ − ηv. Nesterov momentum evaluates the gradient at a “look-ahead” point, often improving responsiveness. These methods use one global learning rate (per parameter group) and the same update scale for all coordinates—tuning η and batch size matters a lot.

RMSprop and Adam

RMSprop keeps a moving average of squared gradients and divides the update by its root mean square, giving per-parameter scaling. Adam combines momentum (first moment) with RMSprop-like variance normalization (second moment), with bias correction for early steps. Default hyperparameters (β₁=0.9, β₂=0.999, ε small) work surprisingly often out of the box.

AdamW fixes how weight decay interacts with Adam’s adaptive scaling: L2 penalty is applied directly to weights rather than being mixed into the gradient that Adam rescales—this matches modern transformer and vision training recipes better than “Adam + L2” in older form.

Generalization. Adaptive methods sometimes find sharper minima than carefully tuned SGD; if your validation gap is odd, try SGD+Momentum with cosine schedule as an ablation—not because Adam is “wrong,” but because the loss landscape interaction differs.

Practical Notes

  • Use parameter groups for different learning rates (e.g. backbone vs head in transfer learning).
  • Gradient clipping caps norm before the optimizer step—stabilizes RNNs and large language models.
  • Optimizer choice pairs with learning rate schedules (warmup, cosine decay)—see the next tutorial in the series.

PyTorch: AdamW and SGD

Typical setup
import torch.optim as optim

# Default for many deep models
opt_adamw = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# ImageNet-style (example hyperparameters — tune for your setup)
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True,
)

Summary

  • SGD is simple; momentum helps curvature; Adam/AdamW adapt per-parameter step sizes.
  • AdamW is preferred when using meaningful weight decay with Adam.
  • Tune learning rate (and schedule) with your optimizer and batch size together.
  • Next: learning rate schedules and warmup.

Continue to Learning rate schedules—how η changes during training often matters as much as the optimizer family.