Related Neural Networks Links
Learn Vanishing Gradients Neural Networks Tutorial, validate concepts with Vanishing Gradients Neural Networks MCQ Questions, and prepare interviews through Vanishing Gradients Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Vanishing & Exploding Gradients — 15 Interview Questions
Products of Jacobians through depth and time, sigmoid saturation, ResNets, LSTM gates, and gradient clipping.
Colored left borders per card; green / amber / red difficulty chips.
Vanishing
Exploding
ResNet
Clip
1 What is the vanishing gradient problem?Easy
Answer: In backprop, gradients are products of terms through layers. If many factors are < 1 (e.g. saturated sigmoid derivatives), early-layer gradients → 0 and those weights barely update.
2 What is exploding gradients?Easy
Answer: Same product picture: factors > 1 repeatedly → huge gradients, unstable updates, NaNs. Common in RNNs over long sequences if unrolled.
3 Why do sigmoid/tanh worsen vanishing?Medium
Answer: Derivatives are ≤ 0.25 (sigmoid) or small in saturation—each layer shrinks the backward signal when stacked deeply.
4 How does ReLU help (and one caveat)?Medium
Answer: Derivative is 1 for active neurons—less shrinkage than sigmoid. Caveat: dead ReLUs still pass zero gradient.
5 How do residual connections improve gradient flow?Medium
Answer: y = F(x)+x adds an identity path; gradient can bypass stacked layers via +1 shortcut—eases training of very deep nets.
6 LSTM vs vanilla RNN for vanishing gradients.Medium
Answer: LSTM’s cell state and additive updates with gates allow better long-range gradient flow than simple tanh recurrence where each step multiplies Jacobians.
7 GRU vs LSTM—gradient angle.Easy
Answer: GRU has fewer gates but similar idea: gating and update blending to mitigate vanishing in sequences—often comparable performance with less compute.
8 Gradient clipping—norm vs value.Medium
Answer: Clip by norm: if ||g|| > threshold, scale g down—common in RNNs. Value clipping caps each element—less common. Stops one batch from destroying weights.
9 Does batch norm fix vanishing gradients?Medium
Answer: It stabilizes activations and can help optimization indirectly—not a guarantee; deep nets still benefit from good init, ReLU, residuals.
10 Highway networks vs ResNets (brief).Hard
Answer: Highways use learned gates on skip vs transform; ResNet uses identity skip + simpler F(x). ResNet won for simplicity and performance in vision.
11 Do Transformers vanish like RNNs?Medium
Answer: Depth is finite and paths include residual + LN; attention mixes tokens in O(1) depth per layer—not the same T-step product as unrolled RNN, but very deep stacks still need design care.
12 Link to weight initialization.Easy
Answer: Good init keeps activations in a range where derivatives aren’t tiny everywhere—reduces extreme products at start of training.
13 How might you detect exploding gradients in logs?Easy
Answer: Sudden NaN loss, gradient norm spikes, weights blow up—watch global norm per step.
14 Mixed precision and loss scaling.Hard
Answer: Small gradients can underflow in fp16; loss scaling multiplies loss before backward so gradients stay in representable range, then unscales for the optimizer.
15 One-sentence fix menu for interviews.Easy
Answer: Use ReLU, good init, BN/LN, residuals, gated RNNs or attention, and gradient clipping when training recurrent or unstable nets.
Always mention products along the graph—the core math story.
Quick review checklist
- Vanish vs explode; sigmoid vs ReLU; ResNet shortcut.
- LSTM/GRU; gradient clipping; BN’s indirect role.
- Transformers vs RNN depth; loss scaling mention.