RNN & LSTM: 20 Interview Questions

Question 1

1 What is a Recurrent Neural Network (RNN)? Typical applications? âš¡ Easy

Answer

Answer: RNNs process sequential data by maintaining a hidden state that captures information from previous time steps. They share weights across time. Applications: NLP (language modeling, translation), time series forecasting, speech recognition, video analysis.

Question 2

2 Why do vanilla RNNs suffer from vanishing/exploding gradient? ðŸ“Š Medium

Answer

Answer: During BPTT, gradients are multiplied by the same recurrent weight matrix W_hh at each time step. If eigenvalues of W_hh < 1, gradients vanish; if > 1, they explode. This prevents learning long-range dependencies.

Question 3

3 Explain LSTM architecture. How does it solve vanishing gradient? ðŸ”¥ Hard

Answer

Answer: LSTM introduces cell state (C_t) as a memory highway with additive updates. Three gates: forget (f), input (i), output (o). Gradient flows through cell state with constant error carousel (CEC) â€“ addition rather than multiplication, preserving gradient.

Question 4

4 What is the purpose of each LSTM gate? ðŸ“Š Medium

Answer

Answer:

Forget gate: decides what to discard from cell state.
Input gate: decides what new info to store.
Output gate: decides what to output based on cell state.

Question 5

5 How is GRU different from LSTM? When to prefer GRU? ðŸ“Š Medium

Answer

Answer: GRU has 2 gates (update, reset), no separate cell state; merges forget/input gates. Simpler, fewer parameters, less prone to overfitting on small data. LSTM is more expressive; GRU often matches performance with less compute.

Question 6

6 What is a bidirectional RNN? When to use? ðŸ“Š Medium

Answer

Answer: BRNN processes sequence forward and backward, concatenating hidden states. Captures context from both past and future. Used in NLP tasks (NER, POS tagging) where entire sequence is available. Not for real-time or streaming.

Question 7

7 Explain Backpropagation Through Time (BPTT). What is truncated BPTT? ðŸ”¥ Hard

Answer

Answer: BPTT unfolds RNN for all time steps, computes gradients over entire sequence (expensive). Truncated BPTT limits unfolding to k steps, approximates gradients, efficient for long sequences.

Question 8

8 Describe seq2seq model. Role of encoder and decoder? ðŸ“Š Medium

Answer

Answer: Seq2seq uses encoder RNN to compress input sequence into context vector (final hidden state). Decoder RNN generates output sequence from context. Used in machine translation, summarization.

Question 9

9 Why was attention introduced? How does it help RNNs? ðŸ”¥ Hard

Answer

Answer: Seq2seq with single context vector fails for long sentences (bottleneck). Attention allows decoder to look at all encoder hidden states, weighted by relevance. Provides shortcut to gradient flow and interpretability.

Question 10

10 What is peephole LSTM? ðŸ”¥ Hard

Answer

Answer: Peephole connections allow gates to see the cell state (C_{t-1}) in addition to h_{t-1} and x_t. Provides finer temporal control, but not always beneficial.

Question 11

11 What are stacked RNNs? Benefits? ðŸ“Š Medium

Answer

Answer: Multiple RNN layers where hidden state of one layer is input to next. Increases capacity, learns hierarchical representations. Higher layers capture longer-term abstractions.

Question 12

12 Why do RNNs share weights across time? ðŸ“Š Medium

Answer

Answer: Weight sharing enables generalization across sequence lengths and positions. Model learns transition function independent of time step. Reduces parameters dramatically.

Question 13

13 How to handle exploding gradient in RNNs? ðŸ“Š Medium

Answer

Answer: Gradient clipping: rescale gradient if norm exceeds threshold. Also weight regularization, careful initialization (e.g., identity matrix for recurrent weights).

Question 14

14 How do LSTM/GRU mitigate vanishing gradient specifically? ðŸ”¥ Hard

Answer

Answer: LSTM's cell state has additive (not multiplicative) gradient flow. Forget gate can be close to 1, preserving gradient. GRU similarly uses additive update via update gate. Both create gradient highways.

Question 15

15 How do RNNs handle variable-length sequences? âš¡ Easy

Answer

Answer: RNNs process tokens one by one; hidden state adapts. For batching, we pad sequences to same length and use masking to ignore padding.

Question 16

16 What is teacher forcing in RNN training? ðŸ“Š Medium

Answer

Answer: During decoder training, use ground truth previous output instead of model's own prediction. Speeds convergence, but creates exposure bias. Scheduled sampling gradually shifts to self-generated.

Question 17

17 Compare RNNs and Transformers for sequence modeling. ðŸ”¥ Hard

Answer

Answer: RNNs sequential (O(n) steps), Transformers parallel (self-attention O(nÂ²)). Transformers better long-range, but need position encoding. RNNs lighter for short sequences, lower memory.

Question 18

18 When would you still choose RNN/LSTM today? ðŸ“Š Medium

Answer

Answer: Low-latency streaming tasks (speech recognition), small datasets, mobile/edge devices (lightweight), or when interpretability of hidden states is useful.

Question 19

19 What is beam search in RNN decoders? ðŸ”¥ Hard

Answer

Answer: Instead of greedy decoding, beam search keeps k most probable sequences at each step. Finds higher-likelihood outputs at cost of computation.

Question 20

20 Why initialize LSTM forget gate bias to 1? ðŸ”¥ Hard

Answer

Answer: Setting forget gate bias to 1 (or large positive) at initialization helps gradient flow by reducing forgetting early in training. Standard practice (e.g., in TensorFlow/PyTorch).

Related Deep Learning Links

RNN & LSTM: 20 Interview Questions

RNN & LSTM â€“ Interview Cheat Sheet

Vanilla RNN

LSTM

GRU

Modern