Recurrent Neural Networks

Many inputs are sequences: words in a sentence, audio samples, stock prices over days. A vanilla RNN maintains a hidden state h_t updated at each time step from the previous state and current input. The same weights are reused across time (weight tying), which is parameter-efficient but makes backpropagation through time (BPTT) multiply Jacobians along the sequenceâ€”classic vanishing/exploding behavior over long horizons. LSTM and GRU introduce gates to control what to forget, store, and output, enabling better long-range dependencies. For many NLP tasks today, Transformers (attention) have largely superseded RNNsâ€”but RNNs remain useful for streaming data, small models, and pedagogy.

hidden state BPTT bidirectional PyTorch

Vanilla RNN

Schematically: h_t = tanh(W_hh h_tâˆ’1 + W_xh x_t + b). The output at each step can feed a loss (many-to-many), or only the final hidden state can classify the whole sequence (many-to-one). Bidirectional RNNs run one RNN forward and one backward, concatenating states so each position sees past and future contextâ€”common in tagging, not usable for causal autoregressive generation without masking tricks.

Training truncates BPTT to a fixed window to limit memory; very long dependencies still challenge plain RNNs.

LSTM and GRU

LSTM adds a cell state c_t and gates: forget, input, output. The cell updates additively, giving gradients a â€œhighwayâ€ that reduces vanishing compared to repeated tanh squashing alone. GRU merges ideas into fewer gates (reset/update)â€”often similar quality with fewer parameters. Both are drop-in replacements for nn.RNN in PyTorch.

For new sequence projects, try a Transformer or 1D CNN + attention first if data and compute allow; fall back to LSTM for tight latency or tiny footprints.

PyTorch: `nn.LSTM`

Batch-first LSTM

import torch
import torch.nn as nn

# x: (batch, seq_len, input_size)
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2, batch_first=True)
x = torch.randn(32, 50, 128)
out, (h_n, c_n) = lstm(x)
# out[:, -1, :] â€” last timestep; or use out for per-step heads

Summary

RNNs map sequences via a recurrent hidden state and shared weights across time.
BPTT causes gradient issues; LSTM/GRU gates mitigate vanishing over longer spans.
Bidirectional RNNs use future context; unidirectional suits online/decoding settings.
Next: Attentionâ€”soft, content-based aggregation that powers Transformers.

Attention lets every position look at every other position (with weights)â€”the bridge from RNNs to modern language and vision models.

Previous: CNN Next: Attention

Related Neural Networks Links