RNN & LSTM: Mastering Sequence Data

Recurrent Neural Networks process sequences by maintaining a hidden state. LSTM introduces gating to capture long-term dependencies. From vanilla RNNs to attentionâ€”complete mathematical and practical guide.

Time Steps

Unfolding

Vanishing Grad.

LSTM solution

3 Gates

Forget, Input, Output

Bidirectional

Past & Future

Why Recurrent Networks?

Feedforward networks assume independent inputs. For sequences (time series, text, audio), we need memory. RNNs maintain a hidden state that carries information across time steps.

hâ‚œ = tanh(WÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b) â†’ yâ‚œ = W_yÂ·hâ‚œ + b_y

Parameters are shared across time steps. The same W, b used at every step.

Vanilla RNN & Backpropagation Through Time

RNN Cell

hâ‚œ = tanh(W_ihÂ·xâ‚œ + b_ih + W_hhÂ·hâ‚œâ‚‹â‚ + b_hh)

Hidden state combines current input and previous hidden state.

# PyTorch RNN cell (single step)
import torch.nn as nn
rnn_cell = nn.RNNCell(input_size=10, hidden_size=20)

h = torch.zeros(1, 20)  # initial hidden
for t in range(seq_len):
    h = rnn_cell(x[t], h)

Backprop Through Time

Gradients flow backward through time steps. Chain rule multiplies across many tanh derivatives â†’ vanishing/exploding gradients.

Problem: RNNs struggle with long sequences (>10 steps).

# BPTT conceptually
for t in reversed(range(seq_len)):
    # gradient at time t depends on t+1
    grad_h[t] += grad_h[t+1] * W_hh.T * (1 - h[t]**2)

Truncated BPTT: Limit the number of time steps backpropagated (e.g., 20-50 steps). Common in training language models.

The Vanishing Gradient Problem

Why gradients vanish

During BPTT, gradient = âˆ(W_hháµ€ Â· diag(tanh')). tanh' â‰¤ 1. Repeated multiplication makes gradient â†’ 0 for long-term dependencies.

Effect: RNN cannot learn relationships between distant tokens.

Solutions

LSTM/GRU â€“ gating preserves gradients
ReLU + proper init (helps but not robust)
Gradient clipping (for explosion)
Residual connections

LSTM â€“ The Gated Solution

LSTM Gates

fâ‚œ = Ïƒ(W_fÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_f) // forget gate
iâ‚œ = Ïƒ(W_iÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_i) // input gate
oâ‚œ = Ïƒ(W_oÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_o) // output gate
cÌƒâ‚œ = tanh(W_cÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_c) // candidate
câ‚œ = fâ‚œ âŠ™ câ‚œâ‚‹â‚ + iâ‚œ âŠ™ cÌƒâ‚œ // cell state
hâ‚œ = oâ‚œ âŠ™ tanh(câ‚œ)

Cell state acts as a gradient highway. Forget gate controls what to keep/erase. Gradients flow through addition, not multiplication.

LSTM in 30 seconds

Forget â€“ reset cell state
Input â€“ write new info
Output â€“ expose cell state
Cell â€“ long-term memory
Hidden â€“ short-term / output

PyTorch LSTM â€“ from nn.LSTM to custom cell

import torch.nn as nn

# Built-in LSTM
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, 
               batch_first=True, bidirectional=True)
output, (h_n, c_n) = lstm(x)  # x shape: (batch, seq, feature)

# Manual LSTM cell (for understanding)
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.fc = nn.Linear(input_size + hidden_size, hidden_size * 4)
        
    def forward(self, x, h, c):
        gates = self.fc(torch.cat([x, h], dim=1))
        f, i, o, c_tilde = gates.chunk(4, dim=1)
        f, i, o = torch.sigmoid(f), torch.sigmoid(i), torch.sigmoid(o)
        c = f * c + i * torch.tanh(c_tilde)
        h = o * torch.tanh(c)
        return h, c

GRU â€“ LSTM's Leaner Cousin

GRU Gates (only two)

zâ‚œ = Ïƒ(W_zÂ·[hâ‚œâ‚‹â‚, xâ‚œ]) // update gate
râ‚œ = Ïƒ(W_rÂ·[hâ‚œâ‚‹â‚, xâ‚œ]) // reset gate
hÌƒâ‚œ = tanh(WÂ·[râ‚œâŠ™hâ‚œâ‚‹â‚, xâ‚œ])
hâ‚œ = (1-zâ‚œ)âŠ™hâ‚œâ‚‹â‚ + zâ‚œâŠ™hÌƒâ‚œ

Combines forget and input gates. Fewer parameters, often similar performance.

LSTM vs GRU

LSTM	3 gates, cell state, hidden state
GRU	2 gates, only hidden state
Parameters	LSTM â‰ˆ 4Ã—, GRU â‰ˆ 3Ã—
When GRU?	Smaller dataset, faster training

# PyTorch GRU
gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
output, h_n = gru(x)

# TensorFlow/Keras
tf.keras.layers.GRU(units=128, return_sequences=True)

Stacked & Bidirectional RNNs

Stacked (Deep) RNNs

Hidden state of layer t becomes input to next layer. Captures hierarchical features.

lstm = nn.LSTM(input_size, hidden_size, num_layers=3, dropout=0.3)

Dropout between layers (except last).

Bidirectional RNNs

Two independent RNNs: left-to-right and right-to-left. Concatenate outputs. Context from both sides.

lstm = nn.LSTM(input_size, hidden_size, bidirectional=True)
# output shape: (batch, seq, hidden*2)

NLP essential BERT uses bidirectional context.

Encoder-Decoder & Attention

Sequence-to-Sequence

Encoder compresses input sequence to context vector (final hidden). Decoder generates output from context.

Problem: Fixed context bottleneck for long sequences.

Attention Mechanism

Decoder looks at all encoder hidden states. Context = weighted sum of encoder outputs.

eáµ¢â±¼ = score(h_decáµ¢, h_encâ±¼)
Î±áµ¢â±¼ = softmax(eáµ¢â±¼)
cáµ¢ = âˆ‘ Î±áµ¢â±¼ h_encâ±¼

Attention scores: dot product, additive (Bahdanau), or multiplicative (Luong).

Attention is all you need: Transformers replace RNNs with self-attention. But RNN+attention still used in speech, streaming models.

RNN/LSTM in PyTorch & TensorFlow

PyTorch â€“ Sentiment LSTM

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, 
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, (h_n, c_n) = self.lstm(embedded)
        # Concatenate final forward and backward hidden
        h_n = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
        h_n = self.dropout(h_n)
        return self.fc(h_n)

TensorFlow/Keras LSTM

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 128, input_length=100),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Real-World Applications

NLP

Language modeling, NER, translation

Speech

ASR, synthesis, keyword spotting

Time Series

Stock, weather, anomaly detection

Bioinformatics

Protein sequence, gene expression

Optimizer Comparison Table

Model	Gates	State	Long-range	Parameters	When to use
Vanilla RNN	0	h	âŒ	Low	Short sequences, debugging
LSTM	3	h, c	âœ…âœ…	High	Default for complex sequences
GRU	2	h	âœ…	Medium	Small data, faster training
Bidirectional	-	-	âœ…	2Ã—	NLP, complete context available
Stacked	-	-	âœ…	DepthÃ—	Hierarchical features

Training RNNs/LSTMs â€“ Best Practices

âœ… Gradient clipping: Essential. Clip norm to 1.0 or 5.0.

âœ… Initialize forget gate bias to 1 â€“ helps remember at start.

âœ… Layer normalization â€“ stabilizes LSTM training.

âš ï¸ Don't use RNN for very long sequences (>500) â€“ use Transformer or CNN.

âš ï¸ Watch for overfitting â€“ LSTMs have many parameters. Use dropout (variational dropout in PyTorch).

Pro tip: For time series, try batch_first=True and pack_padded_sequence for variable-length sequences.

RNN/LSTM CheatsheetRNN hâ‚œ = tanh(WÂ·[hâ‚œâ‚‹â‚,xâ‚œ])
LSTM 3 gates + cell
GRU update + reset
BPTT unroll then backprop
Bidirectional past+future
Packing variable length
Attention weighted sum
Seq2Seq encoder-decoder

Next Up: Transformers & Attention â€“ Beyond RNNs.

Next: Transformers

Related Deep Learning Links