Related Deep Learning Links
Learn Backpropagation Deep Learning Tutorial, validate concepts with Backpropagation Deep Learning MCQ Questions, and prepare interviews through Backpropagation Deep Learning Interview Questions and Answers.
Backpropagation
20 Essential Q/A
DL Interview Prep
Backpropagation: 20 Interview Questions & Intuition
Master chain rule, computational graphs, gradient flow, vanishing/exploding gradients, autograd, Jacobians, and weight updates. Concise, interview-ready answers with formulas.
Chain Rule
Vanishing Gradient
Gradient Descent
Computational Graph
Jacobian
Autograd
1
What is backpropagation? Explain the intuition.
âš¡ Easy
Answer: Backpropagation computes the gradient of the loss function with respect to every weight using the chain rule. It propagates error signals backward from output to input. Intuition: each neuron's contribution to the final error is measured, then weights are adjusted to reduce loss.
∂L/∂w = ∂L/∂out · ∂out/∂z · ∂z/∂w (chain rule)
2
How does chain rule work in backpropagation?
📊 Medium
Answer: Chain rule multiplies local gradients along the path from loss to weight. For a composition f(g(x)), derivative = f'(g(x))·g'(x). In neural nets, gradients are multiplied backward layer by layer.
# Example: z = wx + b, a = σ(z), L = (a-y)²
dL/da = 2(a-y); da/dz = σ'(z); dz/dw = x → dL/dw = dL/da * da/dz * dz/dw
3
Difference between forward pass and backward pass?
âš¡ Easy
Answer: Forward pass computes predictions and caches intermediate activations. Backward pass computes gradients using cached values and chain rule. Forward is inference; backward is learning.
4
What causes vanishing gradient in backpropagation?
📊 Medium
Answer: Gradients of saturated activations (sigmoid, tanh) are <1; repeated multiplication makes gradients exponentially small in early layers. Also deep networks with many multiplications.
Fix: ReLU, residual connections, batch norm
Sigmoid/tanh in hidden layers
5
Explain exploding gradient. How to mitigate?
📊 Medium
Answer: Gradients become exponentially large due to large weights >1 or poor initialization. Causes unstable updates. Solutions: gradient clipping, weight regularization, careful initialization (Xavier/He).
Clip: if ||g|| > threshold, g = threshold * g / ||g||
6
What is a computational graph? How is it used in backprop?
🔥 Hard
Answer: Directed acyclic graph where nodes are operations/ variables, edges define dependencies. Backprop traverses graph in reverse topological order, multiplying gradients via chain rule. Frameworks (TF, PyTorch) build autograd on this.
7
How does automatic differentiation (autograd) relate to backprop?
🔥 Hard
Answer: Backprop is a special case of reverse-mode autodiff. It efficiently computes gradients of scalar loss w.r.t many parameters in one forward+backward pass. Autograd builds the graph dynamically (PyTorch) or statically (TF1).
8
Why does backprop multiply gradients? Why not add?
🔥 Hard
Answer: Chain rule for composite functions is multiplicative. Each layer's effect compounds; if one layer has zero gradient, whole branch dies. Multiplication reflects dependency strength. Addition would be for parallel paths (e.g., skip connections).
9
Why can't we initialize all weights to zero? Role of backprop?
📊 Medium
Answer: Zero init makes neurons symmetric – same gradient, same updates, no feature diversity. Backprop would compute identical gradients for all neurons in a layer, preventing learning. Random init breaks symmetry.
10
What is gradient checking? How to implement?
🔥 Hard
Answer: Numerically approximate gradient: (L(θ+ε)-L(θ-ε))/(2ε) and compare with analytical backprop gradient. Used for debugging. Must be disabled in training (expensive).
eps = 1e-7; numeric_grad = (loss(w+eps) - loss(w-eps)) / (2*eps)
11
How does backprop work through max pooling?
🔥 Hard
Answer: Gradient only passes to the neuron that achieved the max (argmax). Others get zero gradient. It's like a switch: route error to winner.
12
Derive gradient of softmax + cross-entropy loss.
🔥 Hard
Answer: Combined gradient simplifies to (p - y) where p is softmax output, y is one-hot target. Very elegant and numerically stable.
∂L/∂z_i = p_i - y_i
13
Why are in-place operations (e.g., .relu_()) problematic for backprop?
📊 Medium
Answer: Backprop requires intermediate activations (input to ReLU) to compute gradient. In-place overwrites them, breaking the graph. PyTorch/TF usually avoid or handle carefully.
14
Can backprop compute second-order gradients? How?
🔥 Hard
Answer: Yes, via automatic differentiation on the gradient graph (e.g., PyTorch `torch.autograd.grad`). Used in meta-learning, Hessian-free optimization, etc.
15
Is backprop the same as reverse-mode autodiff?
📊 Medium
Answer: Backprop is the algorithm applied to neural nets; reverse-mode autodiff is the general technique. Backprop = reverse-mode AD applied to a scalar loss with caching.
16
Role of Jacobian matrix in backpropagation?
🔥 Hard
Answer: For vector functions, the local gradient is a Jacobian matrix (∂output/∂input). Backprop multiplies Jacobians along the path. In practice, frameworks use vector-Jacobian products (VJPs) for efficiency.
v^T · J (VJP) instead of full J
17
Define the error signal (δ) in backprop.
📊 Medium
Answer: δ_i^l = ∂L / ∂z_i^l (pre-activation at layer l). It represents how much the total loss changes when the pre-activation changes. Propagated backward: δ^l = (θ^{l+1})^T δ^{l+1} ⊙ σ'(z^l).
18
What is Backpropagation Through Time (BPTT)?
📊 Medium
Answer: BPTT unfolds RNN through time steps, treats it as a deep network with shared weights. Gradients sum over time. Suffers from vanishing/exploding due to repeated multiplications. Truncated BPTT limits steps.
19
Why did greedy layerwise pretraining help backprop in early deep learning?
🔥 Hard
Answer: Initialized weights in a sensible region, avoiding vanishing gradients. Backprop then fine-tuned. Modern techniques (ReLU, batch norm, good init) made pretraining less critical.
20
How do skip connections (ResNet) help backpropagation?
📊 Medium
Answer: Skip connections create an alternative gradient highway – identity mapping. Gradient can flow directly through skip path, mitigating vanishing gradient and enabling very deep networks (>100 layers).
Gradient shorcut: ∂L/∂x = ∂L/∂F(x) + ∂L/∂x (identity)
Backpropagation – Interview Cheat Sheet
Core concepts
- Chain rule Multiplicative gradients
- Vanishing Sigmoid/tanh, deep nets
- Exploding Large weights, clipping
- Autograd Reverse-mode AD
Debugging
- Gradient checking, monitor gradient norms
Modern improvements
- ResNet Skip connections
- Batch Norm Smoother gradients
- ReLU Non-saturating
- Optimizers Adam, SGD with momentum
Framework
- PyTorch: `loss.backward()`; TF: `tf.GradientTape`
âš¡ Key rule: "Backprop = chain rule + caching + gradient descent."