Backpropagation: 20 Interview Questions & Intuition

Question 1

1 What is backpropagation? Explain the intuition. âš¡ Easy

Answer

Answer: Backpropagation computes the gradient of the loss function with respect to every weight using the chain rule. It propagates error signals backward from output to input. Intuition: each neuron's contribution to the final error is measured, then weights are adjusted to reduce loss.

Question 2

2 How does chain rule work in backpropagation? ðŸ“Š Medium

Answer

Answer: Chain rule multiplies local gradients along the path from loss to weight. For a composition f(g(x)), derivative = f'(g(x))Â·g'(x). In neural nets, gradients are multiplied backward layer by layer.

Question 3

3 Difference between forward pass and backward pass? âš¡ Easy

Answer

Answer: Forward pass computes predictions and caches intermediate activations. Backward pass computes gradients using cached values and chain rule. Forward is inference; backward is learning.

Question 4

4 What causes vanishing gradient in backpropagation? ðŸ“Š Medium

Answer

Answer: Gradients of saturated activations (sigmoid, tanh) are <1; repeated multiplication makes gradients exponentially small in early layers. Also deep networks with many multiplications.

Question 5

5 Explain exploding gradient. How to mitigate? ðŸ“Š Medium

Answer

Answer: Gradients become exponentially large due to large weights >1 or poor initialization. Causes unstable updates. Solutions: gradient clipping, weight regularization, careful initialization (Xavier/He).

Question 6

6 What is a computational graph? How is it used in backprop? ðŸ”¥ Hard

Answer

Answer: Directed acyclic graph where nodes are operations/ variables, edges define dependencies. Backprop traverses graph in reverse topological order, multiplying gradients via chain rule. Frameworks (TF, PyTorch) build autograd on this.

Question 7

7 How does automatic differentiation (autograd) relate to backprop? ðŸ”¥ Hard

Answer

Answer: Backprop is a special case of reverse-mode autodiff. It efficiently computes gradients of scalar loss w.r.t many parameters in one forward+backward pass. Autograd builds the graph dynamically (PyTorch) or statically (TF1).

Question 8

8 Why does backprop multiply gradients? Why not add? ðŸ”¥ Hard

Answer

Answer: Chain rule for composite functions is multiplicative. Each layer's effect compounds; if one layer has zero gradient, whole branch dies. Multiplication reflects dependency strength. Addition would be for parallel paths (e.g., skip connections).

Question 9

9 Why can't we initialize all weights to zero? Role of backprop? ðŸ“Š Medium

Answer

Answer: Zero init makes neurons symmetric â€“ same gradient, same updates, no feature diversity. Backprop would compute identical gradients for all neurons in a layer, preventing learning. Random init breaks symmetry.

Question 10

10 What is gradient checking? How to implement? ðŸ”¥ Hard

Answer

Answer: Numerically approximate gradient: (L(Î¸+Îµ)-L(Î¸-Îµ))/(2Îµ) and compare with analytical backprop gradient. Used for debugging. Must be disabled in training (expensive).

Question 11

11 How does backprop work through max pooling? ðŸ”¥ Hard

Answer

Answer: Gradient only passes to the neuron that achieved the max (argmax). Others get zero gradient. It's like a switch: route error to winner.

Question 12

12 Derive gradient of softmax + cross-entropy loss. ðŸ”¥ Hard

Answer

Answer: Combined gradient simplifies to (p - y) where p is softmax output, y is one-hot target. Very elegant and numerically stable.

Question 13

13 Why are in-place operations (e.g., .relu_()) problematic for backprop? ðŸ“Š Medium

Answer

Answer: Backprop requires intermediate activations (input to ReLU) to compute gradient. In-place overwrites them, breaking the graph. PyTorch/TF usually avoid or handle carefully.

Question 14

14 Can backprop compute second-order gradients? How? ðŸ”¥ Hard

Answer

Answer: Yes, via automatic differentiation on the gradient graph (e.g., PyTorch `torch.autograd.grad`). Used in meta-learning, Hessian-free optimization, etc.

Question 15

15 Is backprop the same as reverse-mode autodiff? ðŸ“Š Medium

Answer

Answer: Backprop is the algorithm applied to neural nets; reverse-mode autodiff is the general technique. Backprop = reverse-mode AD applied to a scalar loss with caching.

Question 16

16 Role of Jacobian matrix in backpropagation? ðŸ”¥ Hard

Answer

Answer: For vector functions, the local gradient is a Jacobian matrix (âˆ‚output/âˆ‚input). Backprop multiplies Jacobians along the path. In practice, frameworks use vector-Jacobian products (VJPs) for efficiency.

Question 17

17 Define the error signal (Î´) in backprop. ðŸ“Š Medium

Answer

Answer: Î´_i^l = âˆ‚L / âˆ‚z_i^l (pre-activation at layer l). It represents how much the total loss changes when the pre-activation changes. Propagated backward: Î´^l = (Î¸^{l+1})^T Î´^{l+1} âŠ™ Ïƒ'(z^l).

Question 18

18 What is Backpropagation Through Time (BPTT)? ðŸ“Š Medium

Answer

Answer: BPTT unfolds RNN through time steps, treats it as a deep network with shared weights. Gradients sum over time. Suffers from vanishing/exploding due to repeated multiplications. Truncated BPTT limits steps.

Question 19

19 Why did greedy layerwise pretraining help backprop in early deep learning? ðŸ”¥ Hard

Answer

Answer: Initialized weights in a sensible region, avoiding vanishing gradients. Backprop then fine-tuned. Modern techniques (ReLU, batch norm, good init) made pretraining less critical.

Question 20

20 How do skip connections (ResNet) help backpropagation? ðŸ“Š Medium

Answer

Answer: Skip connections create an alternative gradient highway â€“ identity mapping. Gradient can flow directly through skip path, mitigating vanishing gradient and enabling very deep networks (>100 layers).

Related Deep Learning Links

Backpropagation: 20 Interview Questions & Intuition

Backpropagation â€“ Interview Cheat Sheet

Core concepts

Debugging

Modern improvements

Framework