Deep Learning Optimizers: 20 Interview Questions

Question 1

1 What is an optimizer in deep learning? âš¡ Easy

Answer

Answer: An optimizer is an algorithm that updates model parameters (weights) to minimize the loss function. It implements a variant of gradient descent, controlling learning rate, momentum, and adaptive per-parameter updates.

Question 2

2 Vanilla SGD: pros and cons? ðŸ“Š Medium

Answer

Answer: Pros: Simple, memory efficient, generalizes well. Cons: Slow convergence, sensitive to learning rate, oscillations in ravines, struggles with sparse data.

Question 3

3 How does Momentum optimizer work? ðŸ“Š Medium

Answer

Answer: Accumulates a velocity vector in the direction of persistent gradients. Accelerates convergence, dampens oscillations. v_t = Î³ v_{t-1} + Î·âˆ‡L; Î¸ -= v_t.

Question 4

4 NAG vs standard Momentum: difference? ðŸ”¥ Hard

Answer

Answer: NAG computes gradient at the "lookahead" position (Î¸ - Î³Â·v_prev). This gives a more accurate update, reducing oscillations. Often faster convergence.

Question 5

5 Explain AdaGrad. When is it useful? ðŸ“Š Medium

Answer

Answer: AdaGrad adapts learning rate per parameter: scales inversely with sqrt(sum of squared gradients). Good for sparse data (e.g., embeddings, NLP). Major drawback: learning rate monotonically decays to zero.

Question 6

6 How does RMSprop improve AdaGrad? ðŸ“Š Medium

Answer

Answer: RMSprop uses exponentially moving average of squared gradients, not cumulative sum. Prevents learning rate vanishing. E[gÂ²]_t = Î²Â·E[gÂ²]_{t-1} + (1-Î²)Â·(âˆ‡L)Â². Step = Î·/âˆš(E[gÂ²]+Îµ)Â·âˆ‡L.

Question 7

7 Describe Adam optimizer. Key components? ðŸ”¥ Hard

Answer

Answer: Adam = RMSprop + Momentum. Maintains first moment (mean) and second moment (uncentered variance) of gradients. Bias correction for initial steps. Default Î²1=0.9, Î²2=0.999, Îµ=1e-8. Popular due to fast convergence and robustness.

Question 8

8 AdamW vs Adam: what's the difference? ðŸ”¥ Hard

Answer

Answer: AdamW decouples weight decay from gradient updates. In Adam, L2 regularization is added to loss; AdamW directly subtracts weight decay from parameters. Leads to better generalization, widely used in Transformers (BERT, ViT).

Question 9

9 What is Nadam? Advantage? ðŸ”¥ Hard

Answer

Answer: Nadam = Adam + Nesterov momentum. It applies Nesterov lookahead on top of Adam's momentum. Sometimes converges slightly faster than Adam.

Question 10

10 AdaBelief â€“ how is it different from Adam? ðŸ”¥ Hard

Answer

Answer: AdaBelief modifies Adam: second moment v_t = Î²2Â·v_{t-1} + (1-Î²2)Â·(âˆ‡L - m_t)Â². Stepsize is Î·/(âˆšvÌ‚+Îµ)Â·mÌ‚. Intuition: adapts to "belief" in observed gradient direction. More stable, often better generalization.

Question 11

11 What is Lion optimizer? Key idea? ðŸ”¥ Hard

Answer

Answer: Lion (Evolved Sign Momentum) uses sign of momentum and gradient combination. Update: Î¸ = Î¸ - Î·Â·sign(Î²1Â·m + (1-Î²1)Â·âˆ‡L). Memory efficient, outperforms AdamW in some large-scale tasks.

Question 12

12 Common learning rate schedules? When to use? ðŸ“Š Medium

Answer

Answer: Step decay (reduce by factor every few epochs), exponential decay, cosine annealing, linear warmup. Warmup helps Adam in early training (prevents large variance). Cosine decay popular in Transformers.

Question 13

13 Why does SGD generalize better than Adam? ðŸ”¥ Hard

Answer

Answer: Hypothesis: Adam may converge to sharper minima, while SGD finds flatter minima (better generalization). Also, adaptive methods have implicit regularization differences. However, AdamW with decoupled weight decay narrows the gap.

Question 14

14 What is gradient clipping? Which optimizers need it? ðŸ“Š Medium

Answer

Answer: Clipping limits gradient magnitude to avoid exploding gradients (RNNs, Transformers). Applied per-sample or globally. Essential for LSTM, but also used with Adam in large Transformers.

Question 15

15 Why use learning rate warmup with Adam? ðŸ”¥ Hard

Answer

Answer: In early steps, Adam's second moment (v) is small, causing large effective LR. Warmup gradually increases LR, stabilizing training. Critical for large-scale Transformer training (BERT, GPT).

Question 16

16 What is AdaMax? Relation to Adam? ðŸ“Š Medium

Answer

Answer: AdaMax replaces L2 norm in Adam with L-infinity norm. v_t = max(Î²2Â·v_{t-1}, |âˆ‡L|). More stable for some problems, less common.

Question 17

17 AMSGrad â€“ what problem does it solve? ðŸ”¥ Hard

Answer

Answer: Adam can sometimes increase learning rate (when v decreases). AMSGrad ensures v_t is monotonic: v_hat = max(v_hat, v_t). Guarantees non-increasing step size. Marginal improvement in practice.

Question 18

18 Best optimizer for sparse features (embeddings)? ðŸ“Š Medium

Answer

Answer: AdaGrad, RMSprop, or Adam with sparse updates (lazy Adam). Sparse gradients benefit from per-parameter adaptive LR.

Question 19

19 Why not use second-order optimizers (L-BFGS) in deep learning? ðŸ”¥ Hard

Answer

Answer: Hessian is huge (billions of params). Approximations (L-BFGS) are expensive, need large batches, noisy gradients. Mostly used in small-batch convex problems or K-FAC (rare).

Question 20

20 Heuristic: which optimizer to choose? âš¡ Easy

Answer

Answer: Default: AdamW with cosine decay + warmup (Transformers, CNNs). For NLP/Transformers: AdamW. For CV: SGD with momentum (generalizes well) or AdamW. For sparse embeddings: Adam/AdaGrad. For memory-limited: SGD or Lion.

Related Deep Learning Links

Deep Learning Optimizers: 20 Interview Questions

Optimizers â€“ Interview Cheat Sheet

Non-adaptive

Adaptive (per-param LR)

Recent & Niche

Tricks

Default choice