Optimizers: Driving Neural Network Training
Optimizers implement the parameter update rules that minimize the loss function. From vanilla SGD to adaptive methods like Adam and modern breakthroughs like Lion — complete mathematical and practical reference.
SGD
Vanilla, Momentum, NAG
Adaptive
AdaGrad, RMSprop
Adam
AdamW, Nadam, AMSGrad
Modern
Lion, Adafactor
What is an Optimizer?
Optimizers are algorithms that update model parameters (weights) to minimize the loss function. They determine how to move in the gradient direction — how fast, with what momentum, and with what adaptive scaling. The choice of optimizer critically affects training speed, stability, and final performance.
Optimizers incorporate gradient history, adaptive learning rates, and momentum.
Gradient Descent Variants
Batch GD
Uses entire dataset to compute gradient. θ = θ - lr · ∇L(θ; all data)
Slow Stable Not feasible for large datasets.
Stochastic GD (SGD)
θ = θ - lr · ∇L(θ; xᵢ, yᵢ)
Update per sample. High variance Online learning
Mini-batch GD
θ = θ - lr · ∇L(θ; batch)
Balanced Most common. Batch size 32-512.
import numpy as np
def sgd_update(params, grads, lr=0.01):
"""Simple SGD update"""
for param, grad in zip(params, grads):
param -= lr * grad
return params
Momentum & Nesterov Accelerated Gradient
SGD with Momentum
vₜ = βvₜ₋₁ + (1-β)∇L(θₜ)
θₜ₊₁ = θₜ - lr · vₜ
Accumulates velocity to overcome ravines and accelerate convergence. β typically 0.9.
def momentum_update(params, grads, v, lr=0.01, beta=0.9):
for i, (p, g) in enumerate(zip(params, grads)):
v[i] = beta * v[i] + (1 - beta) * g
p -= lr * v[i]
return params, v
Nesterov Accelerated Gradient (NAG)
vₜ = βvₜ₋₁ + (1-β)∇L(θₜ - lr·βvₜ₋₁)
θₜ₊₁ = θₜ - lr · vₜ
Looks ahead at the approximate future position. Often faster and more stable than standard momentum.
Adaptive Learning Rate Methods
AdaGrad
Gₜ = Gₜ₋₁ + (∇L(θₜ))²
θₜ₊₁ = θₜ - lr/√(Gₜ + ε) · ∇L(θₜ)
Adapts per-parameter learning rates. Good for sparse data. Learning rate decays monotonically.
Weakness: LR becomes infinitesimally small.
RMSprop
E[g²]ₜ = βE[g²]ₜ₋₁ + (1-β)(∇L)²
θₜ₊₁ = θₜ - lr/√(E[g²]ₜ + ε) · ∇L
Unpublished, but widely used. Fixes AdaGrad's decaying LR problem. β typically 0.9.
def rmsprop_update(params, grads, cache, lr=0.001, beta=0.9, eps=1e-8):
for i, (p, g) in enumerate(zip(params, grads)):
cache[i] = beta * cache[i] + (1 - beta) * g**2
p -= lr * g / (np.sqrt(cache[i]) + eps)
return params, cache
Adam & The Adaptive Moment Family
Adam (Adaptive Moment Estimation)
mₜ = β₁mₜ₋₁ + (1-β₁)∇L
vₜ = β₂vₜ₋₁ + (1-β₂)(∇L)²
θₜ₊₁ = θₜ - lr · m̂ₜ/(√v̂ₜ + ε)
Combines momentum (first moment) and RMSprop (second moment). Bias-corrected estimates. β₁=0.9, β₂=0.999, ε=1e-7.
Default optimizer for most tasks
AdamW
θₜ₊₁ = θₜ - lr · (m̂ₜ/(√v̂ₜ+ε) + λθₜ)
Decoupled weight decay. Improves generalization over Adam. Recommended over Adam.
# PyTorch: torch.optim.AdamW
# TensorFlow: tf.keras.optimizers.AdamW
Nadam
Adam + Nesterov momentum. Slightly faster convergence.
AMSGrad
Variant that uses maximum of past squared gradients. Addresses convergence issues.
AdaBelief
Stepsize scaled by belief in observed gradient. More stable.
# Simplified Adam update (conceptual)
def adam_step(param, grad, m, v, t, lr=0.001, b1=0.9, b2=0.999):
m = b1 * m + (1 - b1) * grad
v = b2 * v + (1 - b2) * grad**2
m_hat = m / (1 - b1**t)
v_hat = v / (1 - b2**t)
param -= lr * m_hat / (np.sqrt(v_hat) + 1e-7)
return param, m, v
Modern & Emerging Optimizers
Lion (EvoLved Sign Momentum)
mₜ = β₁mₜ₋₁ + (1-β₁)∇L
θₜ₊₁ = θₜ - lr · sign(β₂mₜ + (1-β₂)∇L)
Discovered by symbolic search. More memory-efficient than Adam. Used in Google's latest models.
Adafactor
Memory-efficient Adam for large models. Factorizes second moment estimates. Used in T5.
LAMB & LARS
Layer-wise Adaptive Rate Scaling. For large-batch training (BERT, ResNet on TPUs).
Learning Rate Scheduling
Even with adaptive optimizers, scheduling the learning rate improves convergence.
Step Decay
Drop LR by factor every few epochs.
# TF: tf.keras.optimizers.schedules.ExponentialDecay
# PyTorch: torch.optim.lr_scheduler.StepLR
Cosine Annealing
Smooth cyclic decay. Often with warm restarts.
tf.keras.optimizers.schedules.CosineDecay
Warmup
Linear increase from 0 to initial LR. Stabilizes large model training.
ReduceLROnPlateau
Reduce LR when validation loss plateaus.
Optimizer Selection Guide
| Optimizer | Adaptive | Momentum | When to use | Memory |
|---|---|---|---|---|
| SGD | ❌ | ❌ | Simple models, CV (with momentum) | Low |
| SGD+Momentum | ❌ | ✅ | Classic CNNs, needs LR tuning | Low |
| RMSprop | ✅ | ❌ | RNNs, online learning | Medium |
| Adam | ✅ | ✅ | Default for most tasks | Medium |
| AdamW | ✅ | ✅ | Transformers, NLP, better generalization | Medium |
| Nadam | ✅ | ✅ (Nesterov) | Slightly faster Adam | Medium |
| Lion | ✅ | ✅ | Memory efficient, vision tasks | Low |
| Adafactor | ✅ | ✅ | Giant models (LLMs) | Very low |
Quick Selection Rules:
- Start with AdamW – works well out-of-the-box.
- For NLP / Transformers: AdamW with cosine decay + warmup.
- For Computer Vision: SGD with momentum can outperform Adam (requires tuning).
- For large models (>1B params): Adafactor or Lion to save memory.
- For sparse data: AdaGrad or Adam.
Optimizers in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
# Common optimizers
model.compile(optimizer='sgd', ...)
model.compile(optimizer='adam', ...)
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=1e-4))
# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(1e-3, decay_steps=10000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)
# Custom optimizer loop
@tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
y_pred = model(x)
loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
import torch.optim as optim
# Optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)
# Training loop
for epoch in range(epochs):
for x, y in dataloader:
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()
scheduler.step()
Optimizer Hyperparameter Tuning
LR Range Test: Increase LR exponentially each batch, plot loss. Optimal LR is just before loss explodes.