Backpropagation
20 Essential Q/A
DL Interview Prep
Backpropagation: 20 Interview Questions & Intuition
Master chain rule, computational graphs, gradient flow, vanishing/exploding gradients, autograd, Jacobians, and weight updates. Concise, interview-ready answers with formulas.
Chain Rule
Vanishing Gradient
Gradient Descent
Computational Graph
Jacobian
Autograd
1
What is backpropagation? Explain the intuition.
⚡ Easy
Answer: Backpropagation computes the gradient of the loss function with respect to every weight using the chain rule. It propagates error signals backward from output to input. Intuition: each neuron's contribution to the final error is measured, then weights are adjusted to reduce loss.
∂L/∂w = ∂L/∂out · ∂out/∂z · ∂z/∂w (chain rule)
2
How does chain rule work in backpropagation?
📊 Medium
Answer: Chain rule multiplies local gradients along the path from loss to weight. For a composition f(g(x)), derivative = f'(g(x))·g'(x). In neural nets, gradients are multiplied backward layer by layer.
# Example: z = wx + b, a = σ(z), L = (a-y)²
dL/da = 2(a-y); da/dz = σ'(z); dz/dw = x → dL/dw = dL/da * da/dz * dz/dw
3
Difference between forward pass and backward pass?
⚡ Easy
Answer: Forward pass computes predictions and caches intermediate activations. Backward pass computes gradients using cached values and chain rule. Forward is inference; backward is learning.
4
What causes vanishing gradient in backpropagation?
📊 Medium
Answer: Gradients of saturated activations (sigmoid, tanh) are <1; repeated multiplication makes gradients exponentially small in early layers. Also deep networks with many multiplications.
Fix: ReLU, residual connections, batch norm
Sigmoid/tanh in hidden layers
5
Explain exploding gradient. How to mitigate?
📊 Medium
Answer: Gradients become exponentially large due to large weights >1 or poor initialization. Causes unstable updates. Solutions: gradient clipping, weight regularization, careful initialization (Xavier/He).
Clip: if ||g|| > threshold, g = threshold * g / ||g||
6
What is a computational graph? How is it used in backprop?
🔥 Hard
Answer: Directed acyclic graph where nodes are operations/ variables, edges define dependencies. Backprop traverses graph in reverse topological order, multiplying gradients via chain rule. Frameworks (TF, PyTorch) build autograd on this.
7
How does automatic differentiation (autograd) relate to backprop?
🔥 Hard
Answer: Backprop is a special case of reverse-mode autodiff. It efficiently computes gradients of scalar loss w.r.t many parameters in one forward+backward pass. Autograd builds the graph dynamically (PyTorch) or statically (TF1).
8
Why does backprop multiply gradients? Why not add?
🔥 Hard
Answer: Chain rule for composite functions is multiplicative. Each layer's effect compounds; if one layer has zero gradient, whole branch dies. Multiplication reflects dependency strength. Addition would be for parallel paths (e.g., skip connections).
9
Why can't we initialize all weights to zero? Role of backprop?
📊 Medium
Answer: Zero init makes neurons symmetric – same gradient, same updates, no feature diversity. Backprop would compute identical gradients for all neurons in a layer, preventing learning. Random init breaks symmetry.
10
What is gradient checking? How to implement?
🔥 Hard
Answer: Numerically approximate gradient: (L(θ+ε)-L(θ-ε))/(2ε) and compare with analytical backprop gradient. Used for debugging. Must be disabled in training (expensive).
eps = 1e-7; numeric_grad = (loss(w+eps) - loss(w-eps)) / (2*eps)
11
How does backprop work through max pooling?
🔥 Hard
Answer: Gradient only passes to the neuron that achieved the max (argmax). Others get zero gradient. It's like a switch: route error to winner.
12
Derive gradient of softmax + cross-entropy loss.
🔥 Hard
Answer: Combined gradient simplifies to (p - y) where p is softmax output, y is one-hot target. Very elegant and numerically stable.
∂L/∂z_i = p_i - y_i
13
Why are in-place operations (e.g., .relu_()) problematic for backprop?
📊 Medium
Answer: Backprop requires intermediate activations (input to ReLU) to compute gradient. In-place overwrites them, breaking the graph. PyTorch/TF usually avoid or handle carefully.
14
Can backprop compute second-order gradients? How?
🔥 Hard
Answer: Yes, via automatic differentiation on the gradient graph (e.g., PyTorch `torch.autograd.grad`). Used in meta-learning, Hessian-free optimization, etc.
15
Is backprop the same as reverse-mode autodiff?
📊 Medium
Answer: Backprop is the algorithm applied to neural nets; reverse-mode autodiff is the general technique. Backprop = reverse-mode AD applied to a scalar loss with caching.
16
Role of Jacobian matrix in backpropagation?
🔥 Hard
Answer: For vector functions, the local gradient is a Jacobian matrix (∂output/∂input). Backprop multiplies Jacobians along the path. In practice, frameworks use vector-Jacobian products (VJPs) for efficiency.
v^T · J (VJP) instead of full J
17
Define the error signal (δ) in backprop.
📊 Medium
Answer: δ_i^l = ∂L / ∂z_i^l (pre-activation at layer l). It represents how much the total loss changes when the pre-activation changes. Propagated backward: δ^l = (θ^{l+1})^T δ^{l+1} ⊙ σ'(z^l).
18
What is Backpropagation Through Time (BPTT)?
📊 Medium
Answer: BPTT unfolds RNN through time steps, treats it as a deep network with shared weights. Gradients sum over time. Suffers from vanishing/exploding due to repeated multiplications. Truncated BPTT limits steps.
19
Why did greedy layerwise pretraining help backprop in early deep learning?
🔥 Hard
Answer: Initialized weights in a sensible region, avoiding vanishing gradients. Backprop then fine-tuned. Modern techniques (ReLU, batch norm, good init) made pretraining less critical.
20
How do skip connections (ResNet) help backpropagation?
📊 Medium
Answer: Skip connections create an alternative gradient highway – identity mapping. Gradient can flow directly through skip path, mitigating vanishing gradient and enabling very deep networks (>100 layers).
Gradient shorcut: ∂L/∂x = ∂L/∂F(x) + ∂L/∂x (identity)
Backpropagation – Interview Cheat Sheet
Core concepts
- Chain rule Multiplicative gradients
- Vanishing Sigmoid/tanh, deep nets
- Exploding Large weights, clipping
- Autograd Reverse-mode AD
Debugging
- Gradient checking, monitor gradient norms
Modern improvements
- ResNet Skip connections
- Batch Norm Smoother gradients
- ReLU Non-saturating
- Optimizers Adam, SGD with momentum
Framework
- PyTorch: `loss.backward()`; TF: `tf.GradientTape`
⚡ Key rule: "Backprop = chain rule + caching + gradient descent."