Related Neural Networks Links
Learn Gradient Descent Neural Networks Tutorial, validate concepts with Gradient Descent Neural Networks MCQ Questions, and prepare interviews through Gradient Descent Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Gradient Descent — 15 Interview Questions
First-order optimization: step size, noise from minibatches, escaping plateaus, and how momentum reshapes the update—before backprop supplies the actual gradients.
Colored left borders per card; green / amber / red difficulty chips.
Optimization
Learning rate
Mini-batch
Momentum
1 What is (vanilla) gradient descent?Easy
Answer: Repeatedly update parameters θ by stepping opposite the gradient of the loss L: smaller loss along steepest descent direction (locally). Requires differentiable objective (or subgradients).
θ ↠θ − η ∇_θ L(θ)
2 Full-batch vs mini-batch vs stochastic GD.Easy
Answer: Full-batch: gradient over all data—accurate but expensive. Mini-batch: noisy estimate, GPU-friendly. SGD often means mini-batch in practice; strict one-sample SGD is very noisy.
3 What happens if the learning rate is too large or too small?Easy
Answer: Too large: oscillation or divergence. Too small: painfully slow progress and risk of stalling in flat regions. Schedules and adaptive methods address this.
4 Local minima vs global minimum—what do we say for deep nets?Medium
Answer: Non-convex losses have many local minima and saddle points. In high dimensions saddles are common; “bad†local minima are less universal than folklore suggests, but optimization is still hard.
5 What is a saddle point?Medium
Answer: A point where the gradient is zero but it is neither minimum nor maximum—curvature positive in some directions, negative in others. GD can slow down near saddles; noise (minibatches) or momentum helps escape.
6 Momentum—intuition.Medium
Answer: Accumulate a velocity vector damped by friction so updates continue through noisy or ill-conditioned directions—like a ball rolling in a valley. Helps damp oscillations in narrow valleys.
7 When does GD find the global minimum (classically)?Hard
Answer: For convex smooth functions with appropriate step sizes, GD converges to a global minimizer. Deep neural network losses are generally not convex—this theorem does not apply directly.
8 How does batch size affect the gradient estimate?Medium
Answer: Larger batch → lower variance gradient, more stable steps, more memory. Smaller batch → noisier updates that can act like regularization and help generalization (with caveats).
9 Epoch vs iteration in training loops.Easy
Answer: One epoch is a full pass over the training set (possibly shuffled). One iteration is one parameter update (often one mini-batch). Multiple iterations per epoch.
10 Online learning—relation to SGD.Medium
Answer: Data arrives as a stream; each (or few) example(s) trigger an update—natural fit for stochastic style updates. Same GD template with a non-stationary distribution.
11 What is a plateau for optimization?Easy
Answer: A nearly flat region where gradients are tiny—progress slows. Learning-rate warmup, restarts, or second-order hints (in advanced optimizers) can help.
12 Can gradient noise help generalization?Hard
Answer: Small-batch stochasticity can help escape sharp minima and explore the landscape—linked to flat minima hypotheses. Not a guarantee; interaction with batch norm and scale matters.
13 Line search vs fixed learning rate.Medium
Answer: Line search picks step size along the descent direction by evaluating the objective—common in classical optimization, rare in large deep learning (too expensive); deep learning favors hand-tuned or scheduled η.
14 What is preconditioning at a high level?Hard
Answer: Rescaling coordinates so the landscape is more isotropic—Newton-like methods use inverse Hessian; Adam/RMSprop use diagonal adaptive scaling as a cheap approximation.
15 GD update vs what backprop computes.Easy
Answer: Backpropagation computes ∇L; gradient descent uses that vector to update weights. They answer different questions: “which direction?†vs “how to step?â€
Keep GD answers separate from chain rule mechanics—interviewers often ask both in sequence.
Quick review checklist
- Update rule; batch vs mini-batch; learning rate effects.
- Local minima vs saddles; momentum intuition; batch size vs noise.
- GD vs backprop roles; epoch vs iteration.