Backpropagation 20 Essential Q/A
DL Interview Prep

Backpropagation: 20 Interview Questions & Intuition

Master chain rule, computational graphs, gradient flow, vanishing/exploding gradients, autograd, Jacobians, and weight updates. Concise, interview-ready answers with formulas.

Chain Rule Vanishing Gradient Gradient Descent Computational Graph Jacobian Autograd
1 What is backpropagation? Explain the intuition. ⚡ Easy
Answer: Backpropagation computes the gradient of the loss function with respect to every weight using the chain rule. It propagates error signals backward from output to input. Intuition: each neuron's contribution to the final error is measured, then weights are adjusted to reduce loss.
∂L/∂w = ∂L/∂out · ∂out/∂z · ∂z/∂w (chain rule)
2 How does chain rule work in backpropagation? 📊 Medium
Answer: Chain rule multiplies local gradients along the path from loss to weight. For a composition f(g(x)), derivative = f'(g(x))·g'(x). In neural nets, gradients are multiplied backward layer by layer.
# Example: z = wx + b, a = σ(z), L = (a-y)²
dL/da = 2(a-y); da/dz = σ'(z); dz/dw = x → dL/dw = dL/da * da/dz * dz/dw
3 Difference between forward pass and backward pass? ⚡ Easy
Answer: Forward pass computes predictions and caches intermediate activations. Backward pass computes gradients using cached values and chain rule. Forward is inference; backward is learning.
4 What causes vanishing gradient in backpropagation? 📊 Medium
Answer: Gradients of saturated activations (sigmoid, tanh) are <1; repeated multiplication makes gradients exponentially small in early layers. Also deep networks with many multiplications.
Fix: ReLU, residual connections, batch norm
Sigmoid/tanh in hidden layers
5 Explain exploding gradient. How to mitigate? 📊 Medium
Answer: Gradients become exponentially large due to large weights >1 or poor initialization. Causes unstable updates. Solutions: gradient clipping, weight regularization, careful initialization (Xavier/He).
Clip: if ||g|| > threshold, g = threshold * g / ||g||
6 What is a computational graph? How is it used in backprop? 🔥 Hard
Answer: Directed acyclic graph where nodes are operations/ variables, edges define dependencies. Backprop traverses graph in reverse topological order, multiplying gradients via chain rule. Frameworks (TF, PyTorch) build autograd on this.
7 How does automatic differentiation (autograd) relate to backprop? 🔥 Hard
Answer: Backprop is a special case of reverse-mode autodiff. It efficiently computes gradients of scalar loss w.r.t many parameters in one forward+backward pass. Autograd builds the graph dynamically (PyTorch) or statically (TF1).
8 Why does backprop multiply gradients? Why not add? 🔥 Hard
Answer: Chain rule for composite functions is multiplicative. Each layer's effect compounds; if one layer has zero gradient, whole branch dies. Multiplication reflects dependency strength. Addition would be for parallel paths (e.g., skip connections).
9 Why can't we initialize all weights to zero? Role of backprop? 📊 Medium
Answer: Zero init makes neurons symmetric – same gradient, same updates, no feature diversity. Backprop would compute identical gradients for all neurons in a layer, preventing learning. Random init breaks symmetry.
10 What is gradient checking? How to implement? 🔥 Hard
Answer: Numerically approximate gradient: (L(θ+ε)-L(θ-ε))/(2ε) and compare with analytical backprop gradient. Used for debugging. Must be disabled in training (expensive).
eps = 1e-7; numeric_grad = (loss(w+eps) - loss(w-eps)) / (2*eps)
11 How does backprop work through max pooling? 🔥 Hard
Answer: Gradient only passes to the neuron that achieved the max (argmax). Others get zero gradient. It's like a switch: route error to winner.
12 Derive gradient of softmax + cross-entropy loss. 🔥 Hard
Answer: Combined gradient simplifies to (p - y) where p is softmax output, y is one-hot target. Very elegant and numerically stable.
∂L/∂z_i = p_i - y_i
13 Why are in-place operations (e.g., .relu_()) problematic for backprop? 📊 Medium
Answer: Backprop requires intermediate activations (input to ReLU) to compute gradient. In-place overwrites them, breaking the graph. PyTorch/TF usually avoid or handle carefully.
14 Can backprop compute second-order gradients? How? 🔥 Hard
Answer: Yes, via automatic differentiation on the gradient graph (e.g., PyTorch `torch.autograd.grad`). Used in meta-learning, Hessian-free optimization, etc.
15 Is backprop the same as reverse-mode autodiff? 📊 Medium
Answer: Backprop is the algorithm applied to neural nets; reverse-mode autodiff is the general technique. Backprop = reverse-mode AD applied to a scalar loss with caching.
16 Role of Jacobian matrix in backpropagation? 🔥 Hard
Answer: For vector functions, the local gradient is a Jacobian matrix (∂output/∂input). Backprop multiplies Jacobians along the path. In practice, frameworks use vector-Jacobian products (VJPs) for efficiency.
v^T · J (VJP) instead of full J
17 Define the error signal (δ) in backprop. 📊 Medium
Answer: δ_i^l = ∂L / ∂z_i^l (pre-activation at layer l). It represents how much the total loss changes when the pre-activation changes. Propagated backward: δ^l = (θ^{l+1})^T δ^{l+1} ⊙ σ'(z^l).
18 What is Backpropagation Through Time (BPTT)? 📊 Medium
Answer: BPTT unfolds RNN through time steps, treats it as a deep network with shared weights. Gradients sum over time. Suffers from vanishing/exploding due to repeated multiplications. Truncated BPTT limits steps.
19 Why did greedy layerwise pretraining help backprop in early deep learning? 🔥 Hard
Answer: Initialized weights in a sensible region, avoiding vanishing gradients. Backprop then fine-tuned. Modern techniques (ReLU, batch norm, good init) made pretraining less critical.
20 How do skip connections (ResNet) help backpropagation? 📊 Medium
Answer: Skip connections create an alternative gradient highway – identity mapping. Gradient can flow directly through skip path, mitigating vanishing gradient and enabling very deep networks (>100 layers).
Gradient shorcut: ∂L/∂x = ∂L/∂F(x) + ∂L/∂x (identity)

Backpropagation – Interview Cheat Sheet

Core concepts
  • Chain rule Multiplicative gradients
  • Vanishing Sigmoid/tanh, deep nets
  • Exploding Large weights, clipping
  • Autograd Reverse-mode AD
Debugging
  • Gradient checking, monitor gradient norms
Modern improvements
  • ResNet Skip connections
  • Batch Norm Smoother gradients
  • ReLU Non-saturating
  • Optimizers Adam, SGD with momentum
Framework
  • PyTorch: `loss.backward()`; TF: `tf.GradientTape`

⚡ Key rule: "Backprop = chain rule + caching + gradient descent."