Related Neural Networks Links
Learn Optimizers Neural Networks Tutorial, validate concepts with Optimizers Neural Networks MCQ Questions, and prepare interviews through Optimizers Neural Networks Interview Questions and Answers.
Optimizers (SGD, Adam, …)
An optimizer implements the rule that turns gradients into parameter updates each step. Vanilla SGD subtracts η times the gradient. Momentum accumulates a velocity vector to smooth noisy gradients and escape shallow ravines. Adaptive methods (RMSprop, Adam) scale each parameter’s update using recent squared-gradient statistics so effective step sizes differ per weight. In practice AdamW (Adam with decoupled weight decay) is a strong default for many vision and language fine-tuning setups; SGD + momentum with tuned learning rate still competes on large-batch ImageNet-style training.
learning rate η βâ‚, β₂ batch size PyTorch
SGD and Momentum
Stochastic gradient descent: θ ↠θ − η ∇L. Mini-batches approximate the full-dataset gradient cheaply. Without momentum, updates zig-zag in ill-conditioned valleys.
Momentum maintains v ↠βv + ∇L and updates θ ↠θ − ηv. Nesterov momentum evaluates the gradient at a “look-ahead†point, often improving responsiveness. These methods use one global learning rate (per parameter group) and the same update scale for all coordinates—tuning η and batch size matters a lot.
RMSprop and Adam
RMSprop keeps a moving average of squared gradients and divides the update by its root mean square, giving per-parameter scaling. Adam combines momentum (first moment) with RMSprop-like variance normalization (second moment), with bias correction for early steps. Default hyperparameters (βâ‚=0.9, β₂=0.999, ε small) work surprisingly often out of the box.
AdamW fixes how weight decay interacts with Adam’s adaptive scaling: L2 penalty is applied directly to weights rather than being mixed into the gradient that Adam rescales—this matches modern transformer and vision training recipes better than “Adam + L2†in older form.
Practical Notes
- Use parameter groups for different learning rates (e.g. backbone vs head in transfer learning).
- Gradient clipping caps norm before the optimizer step—stabilizes RNNs and large language models.
- Optimizer choice pairs with learning rate schedules (warmup, cosine decay)—see the next tutorial in the series.
PyTorch: AdamW and SGD
import torch.optim as optim
# Default for many deep models
opt_adamw = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
# ImageNet-style (example hyperparameters — tune for your setup)
opt_sgd = optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4,
nesterov=True,
)
Summary
- SGD is simple; momentum helps curvature; Adam/AdamW adapt per-parameter step sizes.
- AdamW is preferred when using meaningful weight decay with Adam.
- Tune learning rate (and schedule) with your optimizer and batch size together.
- Next: learning rate schedules and warmup.
Continue to Learning rate schedules—how η changes during training often matters as much as the optimizer family.