Gradient Descent â€” 15 Interview Questions

First-order optimization: step size, noise from minibatches, escaping plateaus, and how momentum reshapes the updateâ€”before backprop supplies the actual gradients.

Colored left borders per card; green / amber / red difficulty chips.

Optimization Learning rate Mini-batch Momentum

1 What is (vanilla) gradient descent?Easy

Answer: Repeatedly update parameters Î¸ by stepping opposite the gradient of the loss L: smaller loss along steepest descent direction (locally). Requires differentiable objective (or subgradients).

Î¸ â† Î¸ âˆ’ Î· âˆ‡_Î¸ L(Î¸)

2 Full-batch vs mini-batch vs stochastic GD.Easy

Answer: Full-batch: gradient over all dataâ€”accurate but expensive. Mini-batch: noisy estimate, GPU-friendly. SGD often means mini-batch in practice; strict one-sample SGD is very noisy.

3 What happens if the learning rate is too large or too small?Easy

Answer: Too large: oscillation or divergence. Too small: painfully slow progress and risk of stalling in flat regions. Schedules and adaptive methods address this.

4 Local minima vs global minimumâ€”what do we say for deep nets?Medium

Answer: Non-convex losses have many local minima and saddle points. In high dimensions saddles are common; â€œbadâ€ local minima are less universal than folklore suggests, but optimization is still hard.

5 What is a saddle point?Medium

Answer: A point where the gradient is zero but it is neither minimum nor maximumâ€”curvature positive in some directions, negative in others. GD can slow down near saddles; noise (minibatches) or momentum helps escape.

6 Momentumâ€”intuition.Medium

Answer: Accumulate a velocity vector damped by friction so updates continue through noisy or ill-conditioned directionsâ€”like a ball rolling in a valley. Helps damp oscillations in narrow valleys.

7 When does GD find the global minimum (classically)?Hard

Answer: For convex smooth functions with appropriate step sizes, GD converges to a global minimizer. Deep neural network losses are generally not convexâ€”this theorem does not apply directly.

8 How does batch size affect the gradient estimate?Medium

Answer: Larger batch â†’ lower variance gradient, more stable steps, more memory. Smaller batch â†’ noisier updates that can act like regularization and help generalization (with caveats).

9 Epoch vs iteration in training loops.Easy

Answer: One epoch is a full pass over the training set (possibly shuffled). One iteration is one parameter update (often one mini-batch). Multiple iterations per epoch.

10 Online learningâ€”relation to SGD.Medium

Answer: Data arrives as a stream; each (or few) example(s) trigger an updateâ€”natural fit for stochastic style updates. Same GD template with a non-stationary distribution.

11 What is a plateau for optimization?Easy

Answer: A nearly flat region where gradients are tinyâ€”progress slows. Learning-rate warmup, restarts, or second-order hints (in advanced optimizers) can help.

12 Can gradient noise help generalization?Hard

Answer: Small-batch stochasticity can help escape sharp minima and explore the landscapeâ€”linked to flat minima hypotheses. Not a guarantee; interaction with batch norm and scale matters.

13 Line search vs fixed learning rate.Medium

Answer: Line search picks step size along the descent direction by evaluating the objectiveâ€”common in classical optimization, rare in large deep learning (too expensive); deep learning favors hand-tuned or scheduled Î·.

14 What is preconditioning at a high level?Hard

Answer: Rescaling coordinates so the landscape is more isotropicâ€”Newton-like methods use inverse Hessian; Adam/RMSprop use diagonal adaptive scaling as a cheap approximation.

15 GD update vs what backprop computes.Easy

Answer: Backpropagation computes âˆ‡L; gradient descent uses that vector to update weights. They answer different questions: â€œwhich direction?â€ vs â€œhow to step?â€

Keep GD answers separate from chain rule mechanicsâ€”interviewers often ask both in sequence.

Quick review checklist

Update rule; batch vs mini-batch; learning rate effects.
Local minima vs saddles; momentum intuition; batch size vs noise.
GD vs backprop roles; epoch vs iteration.

Previous: Loss functions Next: Backpropagation

Related Neural Networks Links

Gradient Descent â€” 15 Interview Questions

Quick review checklist