Neural Networks Sequences
LSTM GRU

Recurrent Neural Networks

Many inputs are sequences: words in a sentence, audio samples, stock prices over days. A vanilla RNN maintains a hidden state ht updated at each time step from the previous state and current input. The same weights are reused across time (weight tying), which is parameter-efficient but makes backpropagation through time (BPTT) multiply Jacobians along the sequence—classic vanishing/exploding behavior over long horizons. LSTM and GRU introduce gates to control what to forget, store, and output, enabling better long-range dependencies. For many NLP tasks today, Transformers (attention) have largely superseded RNNs—but RNNs remain useful for streaming data, small models, and pedagogy.

hidden state BPTT bidirectional PyTorch

Vanilla RNN

Schematically: ht = tanh(Whh ht−1 + Wxh xt + b). The output at each step can feed a loss (many-to-many), or only the final hidden state can classify the whole sequence (many-to-one). Bidirectional RNNs run one RNN forward and one backward, concatenating states so each position sees past and future context—common in tagging, not usable for causal autoregressive generation without masking tricks.

Training truncates BPTT to a fixed window to limit memory; very long dependencies still challenge plain RNNs.

LSTM and GRU

LSTM adds a cell state ct and gates: forget, input, output. The cell updates additively, giving gradients a “highway” that reduces vanishing compared to repeated tanh squashing alone. GRU merges ideas into fewer gates (reset/update)—often similar quality with fewer parameters. Both are drop-in replacements for nn.RNN in PyTorch.

For new sequence projects, try a Transformer or 1D CNN + attention first if data and compute allow; fall back to LSTM for tight latency or tiny footprints.

PyTorch: nn.LSTM

Batch-first LSTM
import torch
import torch.nn as nn

# x: (batch, seq_len, input_size)
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2, batch_first=True)
x = torch.randn(32, 50, 128)
out, (h_n, c_n) = lstm(x)
# out[:, -1, :] — last timestep; or use out for per-step heads

Summary

  • RNNs map sequences via a recurrent hidden state and shared weights across time.
  • BPTT causes gradient issues; LSTM/GRU gates mitigate vanishing over longer spans.
  • Bidirectional RNNs use future context; unidirectional suits online/decoding settings.
  • Next: Attention—soft, content-based aggregation that powers Transformers.

Attention lets every position look at every other position (with weights)—the bridge from RNNs to modern language and vision models.