RNN & LSTM: Mastering Sequence Data
Recurrent Neural Networks process sequences by maintaining a hidden state. LSTM introduces gating to capture long-term dependencies. From vanilla RNNs to attention—complete mathematical and practical guide.
Time Steps
Unfolding
Vanishing Grad.
LSTM solution
3 Gates
Forget, Input, Output
Bidirectional
Past & Future
Why Recurrent Networks?
Feedforward networks assume independent inputs. For sequences (time series, text, audio), we need memory. RNNs maintain a hidden state that carries information across time steps.
Parameters are shared across time steps. The same W, b used at every step.
Vanilla RNN & Backpropagation Through Time
RNN Cell
hₜ = tanh(W_ih·xₜ + b_ih + W_hh·hₜ₋₁ + b_hh)
Hidden state combines current input and previous hidden state.
# PyTorch RNN cell (single step)
import torch.nn as nn
rnn_cell = nn.RNNCell(input_size=10, hidden_size=20)
h = torch.zeros(1, 20) # initial hidden
for t in range(seq_len):
h = rnn_cell(x[t], h)
Backprop Through Time
Gradients flow backward through time steps. Chain rule multiplies across many tanh derivatives → vanishing/exploding gradients.
Problem: RNNs struggle with long sequences (>10 steps).
# BPTT conceptually
for t in reversed(range(seq_len)):
# gradient at time t depends on t+1
grad_h[t] += grad_h[t+1] * W_hh.T * (1 - h[t]**2)
The Vanishing Gradient Problem
Why gradients vanish
During BPTT, gradient = ∏(W_hhᵀ · diag(tanh')). tanh' ≤ 1. Repeated multiplication makes gradient → 0 for long-term dependencies.
Effect: RNN cannot learn relationships between distant tokens.
Solutions
- LSTM/GRU – gating preserves gradients
- ReLU + proper init (helps but not robust)
- Gradient clipping (for explosion)
- Residual connections
LSTM – The Gated Solution
LSTM Gates
fₜ = σ(W_f·[hₜ₋₁, xₜ] + b_f) // forget gate
iₜ = σ(W_i·[hₜ₋₁, xₜ] + b_i) // input gate
oₜ = σ(W_o·[hₜ₋₁, xₜ] + b_o) // output gate
c̃ₜ = tanh(W_c·[hₜ₋₁, xₜ] + b_c) // candidate
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ // cell state
hₜ = oₜ ⊙ tanh(cₜ)
Cell state acts as a gradient highway. Forget gate controls what to keep/erase. Gradients flow through addition, not multiplication.
LSTM in 30 seconds
- Forget – reset cell state
- Input – write new info
- Output – expose cell state
- Cell – long-term memory
- Hidden – short-term / output
import torch.nn as nn
# Built-in LSTM
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2,
batch_first=True, bidirectional=True)
output, (h_n, c_n) = lstm(x) # x shape: (batch, seq, feature)
# Manual LSTM cell (for understanding)
class LSTMCell(nn.Module):
def __init__(self, input_size, hidden_size):
super().__init__()
self.fc = nn.Linear(input_size + hidden_size, hidden_size * 4)
def forward(self, x, h, c):
gates = self.fc(torch.cat([x, h], dim=1))
f, i, o, c_tilde = gates.chunk(4, dim=1)
f, i, o = torch.sigmoid(f), torch.sigmoid(i), torch.sigmoid(o)
c = f * c + i * torch.tanh(c_tilde)
h = o * torch.tanh(c)
return h, c
GRU – LSTM's Leaner Cousin
GRU Gates (only two)
zₜ = σ(W_z·[hₜ₋₁, xₜ]) // update gate
rₜ = σ(W_r·[hₜ₋₁, xₜ]) // reset gate
h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ])
hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ
Combines forget and input gates. Fewer parameters, often similar performance.
LSTM vs GRU
| LSTM | 3 gates, cell state, hidden state |
|---|---|
| GRU | 2 gates, only hidden state |
| Parameters | LSTM ≈ 4×, GRU ≈ 3× |
| When GRU? | Smaller dataset, faster training |
# PyTorch GRU
gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
output, h_n = gru(x)
# TensorFlow/Keras
tf.keras.layers.GRU(units=128, return_sequences=True)
Stacked & Bidirectional RNNs
Stacked (Deep) RNNs
Hidden state of layer t becomes input to next layer. Captures hierarchical features.
lstm = nn.LSTM(input_size, hidden_size, num_layers=3, dropout=0.3)
Dropout between layers (except last).
Bidirectional RNNs
Two independent RNNs: left-to-right and right-to-left. Concatenate outputs. Context from both sides.
lstm = nn.LSTM(input_size, hidden_size, bidirectional=True)
# output shape: (batch, seq, hidden*2)
NLP essential BERT uses bidirectional context.
Encoder-Decoder & Attention
Sequence-to-Sequence
Encoder compresses input sequence to context vector (final hidden). Decoder generates output from context.
Problem: Fixed context bottleneck for long sequences.
Attention Mechanism
Decoder looks at all encoder hidden states. Context = weighted sum of encoder outputs.
eᵢⱼ = score(h_decᵢ, h_encⱼ)
αᵢⱼ = softmax(eᵢⱼ)
cᵢ = ∑ αᵢⱼ h_encⱼ
Attention scores: dot product, additive (Bahdanau), or multiplicative (Luong).
RNN/LSTM in PyTorch & TensorFlow
class SentimentLSTM(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim,
batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim*2, output_dim)
self.dropout = nn.Dropout(0.3)
def forward(self, x):
embedded = self.embedding(x)
output, (h_n, c_n) = self.lstm(embedded)
# Concatenate final forward and backward hidden
h_n = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
h_n = self.dropout(h_n)
return self.fc(h_n)
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Embedding(10000, 128, input_length=100),
tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(64, return_sequences=True)
),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Real-World Applications
NLP
Language modeling, NER, translation
Speech
ASR, synthesis, keyword spotting
Time Series
Stock, weather, anomaly detection
Bioinformatics
Protein sequence, gene expression
Optimizer Comparison Table
| Model | Gates | State | Long-range | Parameters | When to use |
|---|---|---|---|---|---|
| Vanilla RNN | 0 | h | ❌ | Low | Short sequences, debugging |
| LSTM | 3 | h, c | ✅✅ | High | Default for complex sequences |
| GRU | 2 | h | ✅ | Medium | Small data, faster training |
| Bidirectional | - | - | ✅ | 2× | NLP, complete context available |
| Stacked | - | - | ✅ | Depth× | Hierarchical features |
Training RNNs/LSTMs – Best Practices
Pro tip: For time series, try batch_first=True and pack_padded_sequence for variable-length sequences.