RNN & LSTM
20 Essential Q/A
DL Interview Prep
RNN & LSTM: 20 Interview Questions
Master vanilla RNNs, vanishing gradient, LSTM gates, GRU, bidirectional RNNs, seq2seq, attention, and BPTT. Concise, interview-ready answers with formulas.
RNN
LSTM
GRU
Vanishing Gradient
Bidirectional
Seq2Seq
1
What is a Recurrent Neural Network (RNN)? Typical applications?
⚡ Easy
Answer: RNNs process sequential data by maintaining a hidden state that captures information from previous time steps. They share weights across time. Applications: NLP (language modeling, translation), time series forecasting, speech recognition, video analysis.
h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b_h) ; y_t = W_hy·h_t + b_y
2
Why do vanilla RNNs suffer from vanishing/exploding gradient?
📊 Medium
Answer: During BPTT, gradients are multiplied by the same recurrent weight matrix W_hh at each time step. If eigenvalues of W_hh < 1, gradients vanish; if > 1, they explode. This prevents learning long-range dependencies.
LSTM/GRU mitigate via gating
Vanilla RNN fails for long sequences
3
Explain LSTM architecture. How does it solve vanishing gradient?
🔥 Hard
Answer: LSTM introduces cell state (C_t) as a memory highway with additive updates. Three gates: forget (f), input (i), output (o). Gradient flows through cell state with constant error carousel (CEC) – addition rather than multiplication, preserving gradient.
f_t = σ(W_f·[h_{t-1}, x_t] + b_f)
i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c·[h_{t-1}, x_t] + b_c)
C_t = f_t * C_{t-1} + i_t * C̃_t
o_t = σ(W_o·[h_{t-1}, x_t] + b_o); h_t = o_t * tanh(C_t)
i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c·[h_{t-1}, x_t] + b_c)
C_t = f_t * C_{t-1} + i_t * C̃_t
o_t = σ(W_o·[h_{t-1}, x_t] + b_o); h_t = o_t * tanh(C_t)
4
What is the purpose of each LSTM gate?
📊 Medium
Answer:
- Forget gate: decides what to discard from cell state.
- Input gate: decides what new info to store.
- Output gate: decides what to output based on cell state.
5
How is GRU different from LSTM? When to prefer GRU?
📊 Medium
Answer: GRU has 2 gates (update, reset), no separate cell state; merges forget/input gates. Simpler, fewer parameters, less prone to overfitting on small data. LSTM is more expressive; GRU often matches performance with less compute.
LSTM: 3 gates + cell state; GRU: 2 gates, hidden state only.
6
What is a bidirectional RNN? When to use?
📊 Medium
Answer: BRNN processes sequence forward and backward, concatenating hidden states. Captures context from both past and future. Used in NLP tasks (NER, POS tagging) where entire sequence is available. Not for real-time or streaming.
7
Explain Backpropagation Through Time (BPTT). What is truncated BPTT?
🔥 Hard
Answer: BPTT unfolds RNN for all time steps, computes gradients over entire sequence (expensive). Truncated BPTT limits unfolding to k steps, approximates gradients, efficient for long sequences.
8
Describe seq2seq model. Role of encoder and decoder?
📊 Medium
Answer: Seq2seq uses encoder RNN to compress input sequence into context vector (final hidden state). Decoder RNN generates output sequence from context. Used in machine translation, summarization.
9
Why was attention introduced? How does it help RNNs?
🔥 Hard
Answer: Seq2seq with single context vector fails for long sentences (bottleneck). Attention allows decoder to look at all encoder hidden states, weighted by relevance. Provides shortcut to gradient flow and interpretability.
e_{t,i} = score(h_t^dec, h_i^enc); α_{t,i} = softmax(e_{t,i}); c_t = Σ α_{t,i} h_i^enc
10
What is peephole LSTM?
🔥 Hard
Answer: Peephole connections allow gates to see the cell state (C_{t-1}) in addition to h_{t-1} and x_t. Provides finer temporal control, but not always beneficial.
11
What are stacked RNNs? Benefits?
📊 Medium
Answer: Multiple RNN layers where hidden state of one layer is input to next. Increases capacity, learns hierarchical representations. Higher layers capture longer-term abstractions.
12
Why do RNNs share weights across time?
📊 Medium
Answer: Weight sharing enables generalization across sequence lengths and positions. Model learns transition function independent of time step. Reduces parameters dramatically.
13
How to handle exploding gradient in RNNs?
📊 Medium
Answer: Gradient clipping: rescale gradient if norm exceeds threshold. Also weight regularization, careful initialization (e.g., identity matrix for recurrent weights).
if grad_norm > threshold: grad = grad * (threshold / grad_norm)
14
How do LSTM/GRU mitigate vanishing gradient specifically?
🔥 Hard
Answer: LSTM's cell state has additive (not multiplicative) gradient flow. Forget gate can be close to 1, preserving gradient. GRU similarly uses additive update via update gate. Both create gradient highways.
15
How do RNNs handle variable-length sequences?
⚡ Easy
Answer: RNNs process tokens one by one; hidden state adapts. For batching, we pad sequences to same length and use masking to ignore padding.
16
What is teacher forcing in RNN training?
📊 Medium
Answer: During decoder training, use ground truth previous output instead of model's own prediction. Speeds convergence, but creates exposure bias. Scheduled sampling gradually shifts to self-generated.
17
Compare RNNs and Transformers for sequence modeling.
🔥 Hard
Answer: RNNs sequential (O(n) steps), Transformers parallel (self-attention O(n²)). Transformers better long-range, but need position encoding. RNNs lighter for short sequences, lower memory.
18
When would you still choose RNN/LSTM today?
📊 Medium
Answer: Low-latency streaming tasks (speech recognition), small datasets, mobile/edge devices (lightweight), or when interpretability of hidden states is useful.
19
What is beam search in RNN decoders?
🔥 Hard
Answer: Instead of greedy decoding, beam search keeps k most probable sequences at each step. Finds higher-likelihood outputs at cost of computation.
20
Why initialize LSTM forget gate bias to 1?
🔥 Hard
Answer: Setting forget gate bias to 1 (or large positive) at initialization helps gradient flow by reducing forgetting early in training. Standard practice (e.g., in TensorFlow/PyTorch).
RNN & LSTM – Interview Cheat Sheet
Vanilla RNN
- ✗ Vanishing gradient
- ✗ Short memory
- ✓ Simple, fast
LSTM
- Forget What to discard
- Input What to add
- Output What to reveal
GRU
- Update Merge forget/input
- Reset Ignore past
Modern
- Attention Weighted context
- BiRNN Past+future
- Transformer Parallel, SOTA
Verdict: "LSTM/GRU for sequence memory, attention for alignment, Transformers for scale."