Optimizers
20 Essential Q/A
DL Interview Prep
Deep Learning Optimizers: 20 Interview Questions
Master SGD, Momentum, Nesterov, AdaGrad, RMSprop, Adam, AdamW, Nadam, AdaBelief, Lion, and more. Adaptive learning rates, weight decay, convergence, generalization – concise interview-ready answers.
SGD
Momentum
RMSprop
Adam
AdamW
NAG
1
What is an optimizer in deep learning?
⚡ Easy
Answer: An optimizer is an algorithm that updates model parameters (weights) to minimize the loss function. It implements a variant of gradient descent, controlling learning rate, momentum, and adaptive per-parameter updates.
θ_{t+1} = θ_t - η · ∇L(θ_t) (SGD)
2
Vanilla SGD: pros and cons?
📊 Medium
Answer: Pros: Simple, memory efficient, generalizes well. Cons: Slow convergence, sensitive to learning rate, oscillations in ravines, struggles with sparse data.
generalizes, low memory
slow, plateau, oscillates
3
How does Momentum optimizer work?
📊 Medium
Answer: Accumulates a velocity vector in the direction of persistent gradients. Accelerates convergence, dampens oscillations. v_t = γ v_{t-1} + η∇L; θ -= v_t.
v_t = β·v_{t-1} + (1-β)·∇L; θ = θ - η·v_t (common formulation)
4
NAG vs standard Momentum: difference?
🔥 Hard
Answer: NAG computes gradient at the "lookahead" position (θ - γ·v_prev). This gives a more accurate update, reducing oscillations. Often faster convergence.
# NAG pseudo-code
v_prev = v
v = β*v + η·∇L(θ - β*v_prev)
θ = θ - v
5
Explain AdaGrad. When is it useful?
📊 Medium
Answer: AdaGrad adapts learning rate per parameter: scales inversely with sqrt(sum of squared gradients). Good for sparse data (e.g., embeddings, NLP). Major drawback: learning rate monotonically decays to zero.
G_t = G_{t-1} + (∇L)²; θ_{t+1} = θ_t - η/(√G_t+ε) · ∇L
6
How does RMSprop improve AdaGrad?
📊 Medium
Answer: RMSprop uses exponentially moving average of squared gradients, not cumulative sum. Prevents learning rate vanishing. E[g²]_t = β·E[g²]_{t-1} + (1-β)·(∇L)². Step = η/√(E[g²]+ε)·∇L.
7
Describe Adam optimizer. Key components?
🔥 Hard
Answer: Adam = RMSprop + Momentum. Maintains first moment (mean) and second moment (uncentered variance) of gradients. Bias correction for initial steps. Default β1=0.9, β2=0.999, ε=1e-8. Popular due to fast convergence and robustness.
m_t = β1·m_{t-1} + (1-β1)·∇L; v_t = β2·v_{t-1} + (1-β2)·(∇L)²
m̂ = m/(1-β1^t); v̂ = v/(1-β2^t); θ = θ - η·m̂/(√v̂+ε)
8
AdamW vs Adam: what's the difference?
🔥 Hard
Answer: AdamW decouples weight decay from gradient updates. In Adam, L2 regularization is added to loss; AdamW directly subtracts weight decay from parameters. Leads to better generalization, widely used in Transformers (BERT, ViT).
Adam: θ -= η·(m̂/(√v̂+ε) + λθ) | AdamW: θ -= η·m̂/(√v̂+ε) - η·λθ
9
What is Nadam? Advantage?
🔥 Hard
Answer: Nadam = Adam + Nesterov momentum. It applies Nesterov lookahead on top of Adam's momentum. Sometimes converges slightly faster than Adam.
10
AdaBelief – how is it different from Adam?
🔥 Hard
Answer: AdaBelief modifies Adam: second moment v_t = β2·v_{t-1} + (1-β2)·(∇L - m_t)². Stepsize is η/(√v̂+ε)·m̂. Intuition: adapts to "belief" in observed gradient direction. More stable, often better generalization.
11
What is Lion optimizer? Key idea?
🔥 Hard
Answer: Lion (Evolved Sign Momentum) uses sign of momentum and gradient combination. Update: θ = θ - η·sign(β1·m + (1-β1)·∇L). Memory efficient, outperforms AdamW in some large-scale tasks.
12
Common learning rate schedules? When to use?
📊 Medium
Answer: Step decay (reduce by factor every few epochs), exponential decay, cosine annealing, linear warmup. Warmup helps Adam in early training (prevents large variance). Cosine decay popular in Transformers.
13
Why does SGD generalize better than Adam?
🔥 Hard
Answer: Hypothesis: Adam may converge to sharper minima, while SGD finds flatter minima (better generalization). Also, adaptive methods have implicit regularization differences. However, AdamW with decoupled weight decay narrows the gap.
14
What is gradient clipping? Which optimizers need it?
📊 Medium
Answer: Clipping limits gradient magnitude to avoid exploding gradients (RNNs, Transformers). Applied per-sample or globally. Essential for LSTM, but also used with Adam in large Transformers.
15
Why use learning rate warmup with Adam?
🔥 Hard
Answer: In early steps, Adam's second moment (v) is small, causing large effective LR. Warmup gradually increases LR, stabilizing training. Critical for large-scale Transformer training (BERT, GPT).
16
What is AdaMax? Relation to Adam?
📊 Medium
Answer: AdaMax replaces L2 norm in Adam with L-infinity norm. v_t = max(β2·v_{t-1}, |∇L|). More stable for some problems, less common.
17
AMSGrad – what problem does it solve?
🔥 Hard
Answer: Adam can sometimes increase learning rate (when v decreases). AMSGrad ensures v_t is monotonic: v_hat = max(v_hat, v_t). Guarantees non-increasing step size. Marginal improvement in practice.
18
Best optimizer for sparse features (embeddings)?
📊 Medium
Answer: AdaGrad, RMSprop, or Adam with sparse updates (lazy Adam). Sparse gradients benefit from per-parameter adaptive LR.
19
Why not use second-order optimizers (L-BFGS) in deep learning?
🔥 Hard
Answer: Hessian is huge (billions of params). Approximations (L-BFGS) are expensive, need large batches, noisy gradients. Mostly used in small-batch convex problems or K-FAC (rare).
20
Heuristic: which optimizer to choose?
⚡ Easy
Answer: Default: AdamW with cosine decay + warmup (Transformers, CNNs). For NLP/Transformers: AdamW. For CV: SGD with momentum (generalizes well) or AdamW. For sparse embeddings: Adam/AdaGrad. For memory-limited: SGD or Lion.
AdamW (SOTA), SGD (strong baseline), Lion (emerging)
Optimizers – Interview Cheat Sheet
Non-adaptive
- SGD Simple, generalizes
- Momentum Accelerates, reduces osc.
- NAG Lookahead, smarter
Adaptive (per-param LR)
- AdaGrad Sparse data, decays
- RMSprop Non-decaying
- Adam Momentum+ RMSprop
- AdamW Decoupled WD (SOTA)
- Nadam Adam + NAG
Recent & Niche
- AdaBelief Belief in grad
- Lion Sign-based, efficient
- AMSGrad Monotonic LR
Tricks
- Warmup + Adam(W) for Transformers
- Cosine decay – smooth annealing
- Gradient clipping – RNN, stability
Default choice
- AdamW + cosine decay + warmup
Verdict: "AdamW for modern architectures; SGD for strong baselines; know your adaptive history!"
20 optimizer Q/A covered
Regularization