Neural Networks 15 Essential Q&A
Interview Prep

Recurrent Neural Networks — 15 Interview Questions

Hidden state over time, many-to-one vs seq2seq, BPTT truncation, and why LSTM/GRU replaced vanilla RNNs for long dependencies.

Colored left borders per card; green / amber / red difficulty chips.

Time Hidden LSTM Seq2seq
1 What is a recurrent neural network?Easy
Answer: A model with a hidden state h_t updated each step: h_t = f(h_{t−1}, x_t)—same weights applied across time, suited to sequences.
2 Vanilla RNN update (simple form).Easy
Answer: Often h_t = tanh(W_h h_{t−1} + W_x x_t + b)—nonlinearity and affine combine previous state with current input.
h_t = tanh(W_h h_{t−1} + W_x x_t + b)
3 What is backpropagation through time (BPTT)?Medium
Answer: Unroll the network over T steps into a DAG, run backprop—gradient flows through every time link. Memory and compute grow with T.
4 Truncated BPTT—why?Medium
Answer: Limit backprop depth in time to a window—cheaper and stabilizes training; trades off long-range credit assignment.
5 Why do vanilla RNNs struggle with long sequences?Medium
Answer: Repeated Jacobian products over steps cause vanishing or exploding gradients—hard to learn long-range dependencies.
6 LSTM gates—names and roles.Medium
Answer: Forget (what to erase from cell), input (what to write), output (what to expose from cell). Cell state carries information additively—better gradient paths.
7 GRU vs LSTM—interview contrast.Easy
Answer: GRU merges forget+input into update gate, fewer parameters—often similar quality with less compute; LSTM still common historically.
8 Bidirectional RNN.Easy
Answer: Two RNNs: one forward, one backward; concatenate hidden states—uses future context; good for tagging/NLP, not for causal online prediction.
9 Encoder–decoder (seq2seq) idea.Medium
Answer: Encoder RNN compresses input sequence to context vector; decoder RNN generates output sequence—basis of early NMT before attention dominated.
10 Teacher forcing.Medium
Answer: During training, decoder gets ground-truth previous token as input instead of its own prediction—speeds convergence; exposure bias handled with scheduled sampling etc.
11 Padding and pack_padded_sequence—why?Hard
Answer: Batched variable-length sequences are padded; pack avoids wasted compute on pad tokens and keeps hidden state meaningful in frameworks like PyTorch.
12 Many-to-one vs many-to-many examples.Easy
Answer: Many-to-one: sentiment from a sentence. Many-to-many: POS tagging per token; seq2seq: translation.
13 When do Transformers replace RNNs?Medium
Answer: When you have data/compute for self-attention—parallel over length, long-range in O(1) layers per hop; RNNs sequential and slower on GPU for long sequences.
14 1D CNN for sequences vs RNN.Medium
Answer: 1D conv stacks local n-grams with depth for context—fast and parallel; RNN/attention better for very long flexible dependencies depending on design.
15 State one advantage of RNN family today.Easy
Answer: Small memory per step for streaming or tiny devices; some tasks still use LSTM baselines—though LLMs are Transformer-first.
Draw unrolled RNN for BPTT—classic whiteboard question.

Quick review checklist

  • Recurrence; BPTT; truncate; vanishing in vanilla RNN.
  • LSTM/GRU; bidirectional; encoder–decoder; teacher forcing.
  • Packing padded batches; Transformer comparison.