Optimizers (SGD, Adam, â€¦)

An optimizer implements the rule that turns gradients into parameter updates each step. Vanilla SGD subtracts Î· times the gradient. Momentum accumulates a velocity vector to smooth noisy gradients and escape shallow ravines. Adaptive methods (RMSprop, Adam) scale each parameterâ€™s update using recent squared-gradient statistics so effective step sizes differ per weight. In practice AdamW (Adam with decoupled weight decay) is a strong default for many vision and language fine-tuning setups; SGD + momentum with tuned learning rate still competes on large-batch ImageNet-style training.

learning rate Î· Î²â‚, Î²â‚‚ batch size PyTorch

SGD and Momentum

Stochastic gradient descent: Î¸ â† Î¸ âˆ’ Î· âˆ‡L. Mini-batches approximate the full-dataset gradient cheaply. Without momentum, updates zig-zag in ill-conditioned valleys.

Momentum maintains v â† Î²v + âˆ‡L and updates Î¸ â† Î¸ âˆ’ Î·v. Nesterov momentum evaluates the gradient at a â€œlook-aheadâ€ point, often improving responsiveness. These methods use one global learning rate (per parameter group) and the same update scale for all coordinatesâ€”tuning Î· and batch size matters a lot.

RMSprop and Adam

RMSprop keeps a moving average of squared gradients and divides the update by its root mean square, giving per-parameter scaling. Adam combines momentum (first moment) with RMSprop-like variance normalization (second moment), with bias correction for early steps. Default hyperparameters (Î²â‚=0.9, Î²â‚‚=0.999, Îµ small) work surprisingly often out of the box.

AdamW fixes how weight decay interacts with Adamâ€™s adaptive scaling: L2 penalty is applied directly to weights rather than being mixed into the gradient that Adam rescalesâ€”this matches modern transformer and vision training recipes better than â€œAdam + L2â€ in older form.

Generalization. Adaptive methods sometimes find sharper minima than carefully tuned SGD; if your validation gap is odd, try SGD+Momentum with cosine schedule as an ablationâ€”not because Adam is â€œwrong,â€ but because the loss landscape interaction differs.

Practical Notes

Use parameter groups for different learning rates (e.g. backbone vs head in transfer learning).
Gradient clipping caps norm before the optimizer stepâ€”stabilizes RNNs and large language models.
Optimizer choice pairs with learning rate schedules (warmup, cosine decay)â€”see the next tutorial in the series.

PyTorch: `AdamW` and `SGD`

Typical setup

import torch.optim as optim

# Default for many deep models
opt_adamw = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# ImageNet-style (example hyperparameters â€” tune for your setup)
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True,
)

Summary

SGD is simple; momentum helps curvature; Adam/AdamW adapt per-parameter step sizes.
AdamW is preferred when using meaningful weight decay with Adam.
Tune learning rate (and schedule) with your optimizer and batch size together.
Next: learning rate schedules and warmup.

Continue to Learning rate schedulesâ€”how Î· changes during training often matters as much as the optimizer family.

Previous: Dropout Next: Learning rate schedules

Related Neural Networks Links