Neural Networks 15 Essential Q&A
Interview Prep

Gradient Descent — 15 Interview Questions

First-order optimization: step size, noise from minibatches, escaping plateaus, and how momentum reshapes the update—before backprop supplies the actual gradients.

Colored left borders per card; green / amber / red difficulty chips.

Optimization Learning rate Mini-batch Momentum
1 What is (vanilla) gradient descent?Easy
Answer: Repeatedly update parameters θ by stepping opposite the gradient of the loss L: smaller loss along steepest descent direction (locally). Requires differentiable objective (or subgradients).
θ ← θ − η ∇_θ L(θ)
2 Full-batch vs mini-batch vs stochastic GD.Easy
Answer: Full-batch: gradient over all data—accurate but expensive. Mini-batch: noisy estimate, GPU-friendly. SGD often means mini-batch in practice; strict one-sample SGD is very noisy.
3 What happens if the learning rate is too large or too small?Easy
Answer: Too large: oscillation or divergence. Too small: painfully slow progress and risk of stalling in flat regions. Schedules and adaptive methods address this.
4 Local minima vs global minimum—what do we say for deep nets?Medium
Answer: Non-convex losses have many local minima and saddle points. In high dimensions saddles are common; “bad” local minima are less universal than folklore suggests, but optimization is still hard.
5 What is a saddle point?Medium
Answer: A point where the gradient is zero but it is neither minimum nor maximum—curvature positive in some directions, negative in others. GD can slow down near saddles; noise (minibatches) or momentum helps escape.
6 Momentum—intuition.Medium
Answer: Accumulate a velocity vector damped by friction so updates continue through noisy or ill-conditioned directions—like a ball rolling in a valley. Helps damp oscillations in narrow valleys.
7 When does GD find the global minimum (classically)?Hard
Answer: For convex smooth functions with appropriate step sizes, GD converges to a global minimizer. Deep neural network losses are generally not convex—this theorem does not apply directly.
8 How does batch size affect the gradient estimate?Medium
Answer: Larger batch → lower variance gradient, more stable steps, more memory. Smaller batch → noisier updates that can act like regularization and help generalization (with caveats).
9 Epoch vs iteration in training loops.Easy
Answer: One epoch is a full pass over the training set (possibly shuffled). One iteration is one parameter update (often one mini-batch). Multiple iterations per epoch.
10 Online learning—relation to SGD.Medium
Answer: Data arrives as a stream; each (or few) example(s) trigger an update—natural fit for stochastic style updates. Same GD template with a non-stationary distribution.
11 What is a plateau for optimization?Easy
Answer: A nearly flat region where gradients are tiny—progress slows. Learning-rate warmup, restarts, or second-order hints (in advanced optimizers) can help.
12 Can gradient noise help generalization?Hard
Answer: Small-batch stochasticity can help escape sharp minima and explore the landscape—linked to flat minima hypotheses. Not a guarantee; interaction with batch norm and scale matters.
14 What is preconditioning at a high level?Hard
Answer: Rescaling coordinates so the landscape is more isotropic—Newton-like methods use inverse Hessian; Adam/RMSprop use diagonal adaptive scaling as a cheap approximation.
15 GD update vs what backprop computes.Easy
Answer: Backpropagation computes ∇L; gradient descent uses that vector to update weights. They answer different questions: “which direction?” vs “how to step?”
Keep GD answers separate from chain rule mechanics—interviewers often ask both in sequence.

Quick review checklist

  • Update rule; batch vs mini-batch; learning rate effects.
  • Local minima vs saddles; momentum intuition; batch size vs noise.
  • GD vs backprop roles; epoch vs iteration.