Related Neural Networks Links
Learn Backpropagation Neural Networks Tutorial, validate concepts with Backpropagation Neural Networks MCQ Questions, and prepare interviews through Backpropagation Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Backpropagation — 15 Interview Questions
Chain rule from loss to weights, reverse-mode efficiency, Jacobian–vector products, and why training memory grows with depth.
Colored left borders per card; green / amber / red difficulty chips.
Chain rule
Reverse mode
Activations
Gradients
1 What is backpropagation?Easy
Answer: The standard method to compute ∂L/∂θ for all parameters by applying the chain rule backward from the loss through each layer. It is reverse-mode automatic differentiation on the network graph.
2 State the chain rule for nested functions.Easy
Answer: If y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In many dimensions, derivatives become Jacobians and products become appropriate matrix-vector multiplies.
3 Why reverse mode instead of forward mode for NNs?Medium
Answer: Loss is a scalar; we need one gradient vector w.r.t. millions of parameters. Reverse mode gives all partials in roughly one forward + one backward pass cost. Forward mode would repeat for each parameter.
4 Order of operations in the backward pass.Medium
Answer: Traverse layers from output toward input, propagating adjoint (gradient w.r.t. downstream activations). At each node, multiply local Jacobians into the incoming upstream gradient.
5 Backward through z = Wx + b—gradients w.r.t. W, x, b.Medium
Answer: Given upstream g = ∂L/∂z: ∂L/∂b = sum over batch of g, ∂L/∂W = xᵀg (layout-dependent), ∂L/∂x = gWᵀ. Interviewers check you know dimensions line up.
6 Backward through ReLU.Easy
Answer: Pass gradient through if pre-activation > 0, else zero. At exactly zero, use convention 0 or 1 (subgradient). Elementwise mask.
7 Jacobian–vector product (JVP) vs vector–Jacobian product (VJP).Hard
Answer: JVP pushes perturbations forward (forward mode). VJP pulls loss gradient backward—what each layer implements in backprop. Efficiency: we want VJPs for scalar loss.
8 Why does backprop need memory?Medium
Answer: To compute local derivatives at each layer you need forward activations (and sometimes intermediate tensors). Memory scales with network width, depth, and batch size.
9 Gradient checkpointing—trade-off?Hard
Answer: Don’t store every activation; recompute some during backward. Less memory, more compute—used for large models (Transformers).
11 How does backprop relate to vanishing gradients?Medium
Answer: Backprop multiplies Jacobians layer by layer; if many factors are < 1 (saturating activations), the signal shrinks toward early layers—same math, architectural fix (ReLU, ResNet, gates).
12 Relationship to the computational graph.Easy
Answer: The network is a DAG of ops; forward evaluates nodes, backward applies chain rule along edges. Frameworks build this graph dynamically (eager with autograd) or statically.
13 Does standard training use second derivatives?Medium
Answer: SGD + backprop uses first-order gradients. Second-order (Hessian) methods exist but are expensive; some approximations (K-FAC, etc.) are niche.
14
loss.backward() in PyTorch—what happens?EasyAnswer: Traverses the autograd graph from loss, accumulating .grad on leaf tensors that require grad. Must call zero_grad between iterations unless gradients add intentionally.
15 Time complexity of forward vs backward (typical claim).Medium
Answer: For many networks, backward is roughly 2× the multiply-add cost of forward (same order—rule of thumb). Constant factors depend on fusion and framework.
Practice one small two-layer network by hand once—it locks in the chain rule story.
Quick review checklist
- Define backprop; chain rule; why reverse mode.
- Linear + ReLU local grads; memory and checkpointing.
- Vanishing gradients as product of Jacobians; autograd API basics.