RNN LSTM Sequence Modeling
BPTT Gates

RNN & LSTM: Mastering Sequence Data

Recurrent Neural Networks process sequences by maintaining a hidden state. LSTM introduces gating to capture long-term dependencies. From vanilla RNNs to attention—complete mathematical and practical guide.

Time Steps

Unfolding

Vanishing Grad.

LSTM solution

3 Gates

Forget, Input, Output

Bidirectional

Past & Future

Why Recurrent Networks?

Feedforward networks assume independent inputs. For sequences (time series, text, audio), we need memory. RNNs maintain a hidden state that carries information across time steps.

hₜ = tanh(W·[hₜ₋₁, xₜ] + b) yₜ = W_y·hₜ + b_y

Parameters are shared across time steps. The same W, b used at every step.

Vanilla RNN & Backpropagation Through Time

RNN Cell

hₜ = tanh(W_ih·xₜ + b_ih + W_hh·hₜ₋₁ + b_hh)

Hidden state combines current input and previous hidden state.

# PyTorch RNN cell (single step)
import torch.nn as nn
rnn_cell = nn.RNNCell(input_size=10, hidden_size=20)

h = torch.zeros(1, 20)  # initial hidden
for t in range(seq_len):
    h = rnn_cell(x[t], h)
Backprop Through Time

Gradients flow backward through time steps. Chain rule multiplies across many tanh derivatives → vanishing/exploding gradients.

Problem: RNNs struggle with long sequences (>10 steps).

# BPTT conceptually
for t in reversed(range(seq_len)):
    # gradient at time t depends on t+1
    grad_h[t] += grad_h[t+1] * W_hh.T * (1 - h[t]**2)
Truncated BPTT: Limit the number of time steps backpropagated (e.g., 20-50 steps). Common in training language models.

The Vanishing Gradient Problem

Why gradients vanish

During BPTT, gradient = ∏(W_hhᵀ · diag(tanh')). tanh' ≤ 1. Repeated multiplication makes gradient → 0 for long-term dependencies.

Effect: RNN cannot learn relationships between distant tokens.

Solutions
  • LSTM/GRU – gating preserves gradients
  • ReLU + proper init (helps but not robust)
  • Gradient clipping (for explosion)
  • Residual connections

LSTM – The Gated Solution

LSTM Gates

fₜ = σ(W_f·[hₜ₋₁, xₜ] + b_f) // forget gate
iₜ = σ(W_i·[hₜ₋₁, xₜ] + b_i) // input gate
oₜ = σ(W_o·[hₜ₋₁, xₜ] + b_o) // output gate
c̃ₜ = tanh(W_c·[hₜ₋₁, xₜ] + b_c) // candidate
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ // cell state
hₜ = oₜ ⊙ tanh(cₜ)

Cell state acts as a gradient highway. Forget gate controls what to keep/erase. Gradients flow through addition, not multiplication.

LSTM in 30 seconds
  • Forget – reset cell state
  • Input – write new info
  • Output – expose cell state
  • Cell – long-term memory
  • Hidden – short-term / output
PyTorch LSTM – from nn.LSTM to custom cell
import torch.nn as nn

# Built-in LSTM
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, 
               batch_first=True, bidirectional=True)
output, (h_n, c_n) = lstm(x)  # x shape: (batch, seq, feature)

# Manual LSTM cell (for understanding)
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.fc = nn.Linear(input_size + hidden_size, hidden_size * 4)
        
    def forward(self, x, h, c):
        gates = self.fc(torch.cat([x, h], dim=1))
        f, i, o, c_tilde = gates.chunk(4, dim=1)
        f, i, o = torch.sigmoid(f), torch.sigmoid(i), torch.sigmoid(o)
        c = f * c + i * torch.tanh(c_tilde)
        h = o * torch.tanh(c)
        return h, c

GRU – LSTM's Leaner Cousin

GRU Gates (only two)

zₜ = σ(W_z·[hₜ₋₁, xₜ]) // update gate
rₜ = σ(W_r·[hₜ₋₁, xₜ]) // reset gate
h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ])
hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

Combines forget and input gates. Fewer parameters, often similar performance.

LSTM vs GRU
LSTM3 gates, cell state, hidden state
GRU2 gates, only hidden state
ParametersLSTM ≈ 4×, GRU ≈ 3×
When GRU?Smaller dataset, faster training
# PyTorch GRU
gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
output, h_n = gru(x)

# TensorFlow/Keras
tf.keras.layers.GRU(units=128, return_sequences=True)

Stacked & Bidirectional RNNs

Stacked (Deep) RNNs

Hidden state of layer t becomes input to next layer. Captures hierarchical features.

lstm = nn.LSTM(input_size, hidden_size, num_layers=3, dropout=0.3)

Dropout between layers (except last).

Bidirectional RNNs

Two independent RNNs: left-to-right and right-to-left. Concatenate outputs. Context from both sides.

lstm = nn.LSTM(input_size, hidden_size, bidirectional=True)
# output shape: (batch, seq, hidden*2)

NLP essential BERT uses bidirectional context.

Encoder-Decoder & Attention

Sequence-to-Sequence

Encoder compresses input sequence to context vector (final hidden). Decoder generates output from context.

Problem: Fixed context bottleneck for long sequences.

Attention Mechanism

Decoder looks at all encoder hidden states. Context = weighted sum of encoder outputs.

eᵢⱼ = score(h_decᵢ, h_encⱼ)
αᵢⱼ = softmax(eᵢⱼ)
cᵢ = ∑ αᵢⱼ h_encⱼ

Attention scores: dot product, additive (Bahdanau), or multiplicative (Luong).

Attention is all you need: Transformers replace RNNs with self-attention. But RNN+attention still used in speech, streaming models.

RNN/LSTM in PyTorch & TensorFlow

PyTorch – Sentiment LSTM
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, 
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, (h_n, c_n) = self.lstm(embedded)
        # Concatenate final forward and backward hidden
        h_n = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
        h_n = self.dropout(h_n)
        return self.fc(h_n)
TensorFlow/Keras LSTM
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 128, input_length=100),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Real-World Applications

NLP

Language modeling, NER, translation

Speech

ASR, synthesis, keyword spotting

Time Series

Stock, weather, anomaly detection

Bioinformatics

Protein sequence, gene expression

Optimizer Comparison Table

Model Gates State Long-range Parameters When to use
Vanilla RNN0hLowShort sequences, debugging
LSTM3h, c✅✅HighDefault for complex sequences
GRU2hMediumSmall data, faster training
Bidirectional--NLP, complete context available
Stacked--Depth×Hierarchical features

Training RNNs/LSTMs – Best Practices

✅ Gradient clipping: Essential. Clip norm to 1.0 or 5.0.
✅ Initialize forget gate bias to 1 – helps remember at start.
✅ Layer normalization – stabilizes LSTM training.
⚠️ Don't use RNN for very long sequences (>500) – use Transformer or CNN.
⚠️ Watch for overfitting – LSTMs have many parameters. Use dropout (variational dropout in PyTorch).

Pro tip: For time series, try batch_first=True and pack_padded_sequence for variable-length sequences.

RNN/LSTM Cheatsheet

RNN hₜ = tanh(W·[hₜ₋₁,xₜ])
LSTM 3 gates + cell
GRU update + reset
BPTT unroll then backprop
Bidirectional past+future
Packing variable length
Attention weighted sum
Seq2Seq encoder-decoder
Next Up: Transformers & Attention – Beyond RNNs.