Recurrent Neural Networks â€” 15 Interview Questions

Hidden state over time, many-to-one vs seq2seq, BPTT truncation, and why LSTM/GRU replaced vanilla RNNs for long dependencies.

Colored left borders per card; green / amber / red difficulty chips.

Time Hidden LSTM Seq2seq

1 What is a recurrent neural network?Easy

Answer: A model with a hidden state h_t updated each step: h_t = f(h_{tâˆ’1}, x_t)â€”same weights applied across time, suited to sequences.

2 Vanilla RNN update (simple form).Easy

Answer: Often h_t = tanh(W_h h_{tâˆ’1} + W_x x_t + b)â€”nonlinearity and affine combine previous state with current input.

h_t = tanh(W_h h_{tâˆ’1} + W_x x_t + b)

3 What is backpropagation through time (BPTT)?Medium

Answer: Unroll the network over T steps into a DAG, run backpropâ€”gradient flows through every time link. Memory and compute grow with T.

4 Truncated BPTTâ€”why?Medium

Answer: Limit backprop depth in time to a windowâ€”cheaper and stabilizes training; trades off long-range credit assignment.

5 Why do vanilla RNNs struggle with long sequences?Medium

Answer: Repeated Jacobian products over steps cause vanishing or exploding gradientsâ€”hard to learn long-range dependencies.

6 LSTM gatesâ€”names and roles.Medium

Answer: Forget (what to erase from cell), input (what to write), output (what to expose from cell). Cell state carries information additivelyâ€”better gradient paths.

7 GRU vs LSTMâ€”interview contrast.Easy

Answer: GRU merges forget+input into update gate, fewer parametersâ€”often similar quality with less compute; LSTM still common historically.

8 Bidirectional RNN.Easy

Answer: Two RNNs: one forward, one backward; concatenate hidden statesâ€”uses future context; good for tagging/NLP, not for causal online prediction.

9 Encoderâ€“decoder (seq2seq) idea.Medium

Answer: Encoder RNN compresses input sequence to context vector; decoder RNN generates output sequenceâ€”basis of early NMT before attention dominated.

10 Teacher forcing.Medium

Answer: During training, decoder gets ground-truth previous token as input instead of its own predictionâ€”speeds convergence; exposure bias handled with scheduled sampling etc.

11 Padding and pack_padded_sequenceâ€”why?Hard

Answer: Batched variable-length sequences are padded; pack avoids wasted compute on pad tokens and keeps hidden state meaningful in frameworks like PyTorch.

12 Many-to-one vs many-to-many examples.Easy

Answer: Many-to-one: sentiment from a sentence. Many-to-many: POS tagging per token; seq2seq: translation.

13 When do Transformers replace RNNs?Medium

Answer: When you have data/compute for self-attentionâ€”parallel over length, long-range in O(1) layers per hop; RNNs sequential and slower on GPU for long sequences.

14 1D CNN for sequences vs RNN.Medium

Answer: 1D conv stacks local n-grams with depth for contextâ€”fast and parallel; RNN/attention better for very long flexible dependencies depending on design.

15 State one advantage of RNN family today.Easy

Answer: Small memory per step for streaming or tiny devices; some tasks still use LSTM baselinesâ€”though LLMs are Transformer-first.

Draw unrolled RNN for BPTTâ€”classic whiteboard question.

Quick review checklist

Recurrence; BPTT; truncate; vanishing in vanilla RNN.
LSTM/GRU; bidirectional; encoderâ€“decoder; teacher forcing.
Packing padded batches; Transformer comparison.

Previous: CNN Next: Attention

Related Neural Networks Links

Recurrent Neural Networks â€” 15 Interview Questions

Quick review checklist