Transformers: Attention Is All You Need
The Transformer architecture revolutionized deep learning by replacing recurrence with self-attention. It powers every modern LLM — from BERT and GPT to Llama and Gemini. Complete guide covering mathematics, implementation, and variants.
Self-Attention
Q, K, V matrices
Multi-Head
Parallel attention
Positional Encoding
Sequence order
Encoder-Decoder
Flexible architecture
What is a Transformer?
A Transformer is a deep learning architecture that relies entirely on self-attention to model relationships in sequences. Introduced in 2017 by Vaswani et al., it abandoned recurrence (RNNs) and convolution (CNNs) in favor of parallelizable attention mechanisms. It's the foundation of BERT, GPT, T5, Vision Transformers, and virtually all large language models.
[Encoder Block × N]
┌─────────────────┐
│ Multi-Head │
│ Self-Attention│
└────────┬────────┘
↓ + & Norm
┌───────────────┐
│ Feed-Forward │
└────────┬──────┘
↓ + & Norm
→ Output Probabilities
Transformers process all tokens in parallel. Attention maps global dependencies.
Scaled Dot-Product Attention
Attention Formula
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
Q: Queries, K: Keys, V: Values.
√dₖ: scaling factor to prevent dot products from growing large.
Each token attends to all tokens. Weighted sum of values.
Self-Attention vs Cross-Attention
Self-attention Q, K, V from same sequence (encoder, decoder self-attention).
Cross-attention Q from decoder, K, V from encoder (encoder-decoder attention).
Masked Self-Attention
Prevents attending to future tokens. Used in autoregressive decoders (GPT). Set attention scores to -∞ before softmax.
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (..., seq_len, d_k)
mask: (..., seq_len, seq_len) optional
"""
d_k = Q.shape[-1]
scores = np.matmul(Q, K.transpose(0,1,3,2)) / np.sqrt(d_k) # (..., seq_len, seq_len)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores) # causal mask
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.matmul(attention_weights, V)
return output, attention_weights
Multi-Head Attention
Instead of single attention, project Q, K, V h times with different learned linear projections, perform attention in parallel, concatenate, and project again.
Why multiple heads?
Each head learns different attention patterns: local, global, syntactic, semantic. Standard: h=8, 12, 16, 32 for large models.
MultiHead(Q,K,V)
Concat(head₁,...,headₕ)Wᴼ
headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)
Positional Encoding: Injecting Order
Sinusoidal Encodings
PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
Fixed, no learning. Enables extrapolation.
Learned Positional Embeddings
Trainable vector per position (BERT, GPT). Simpler, but limited to max length.
Modern variants: RoPE (Rotary), ALiBi (attention bias).
import torch
import math
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe.unsqueeze(0) # (1, seq_len, d_model)
Encoder-Decoder & Variants
Encoder-Only
BERT, RoBERTa, DeBERTa. Bidirectional context. Best for understanding tasks: classification, NER, extraction.
Decoder-Only
GPT, Llama, Mistral, Gemini. Autoregressive. Best for generation. Causal masking.
Encoder-Decoder
T5, BART. Sequence-to-sequence. Best for translation, summarization.
Encoder: Self-attention + FFN. Decoder: Masked self-attention + cross-attention + FFN.
Iconic Transformer Models (2017–2025)
BERT (2018)
Bidirectional Encoder. Masked LM + Next Sentence Prediction. 110M–340M params.
GPT-3 (2020)
Autoregressive decoder. 175B params. In-context learning.
T5 (2019)
Text-to-Text Transfer Transformer. Unified framework.
Vision Transformer (ViT) 2020
Split image into patches, treat as sequence. No convolutions.
Llama (2023)
Open-source, efficient. RMSNorm, SwiGLU, RoPE.
Mixture of Experts
Switch Transformer, Mistral. Sparse activation.
Transformer Block in PyTorch
import torch.nn as nn
import torch.nn.functional as F
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.activation = F.relu
def forward(self, src, src_mask=None, src_key_padding_mask=None):
# Self-attention block with residual + norm
x = src
attn_out, _ = self.self_attn(x, x, x, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)
x = x + self.dropout1(attn_out)
x = self.norm1(x)
# Feedforward block with residual + norm
ff_out = self.linear2(self.dropout2(self.activation(self.linear1(x))))
x = x + self.dropout2(ff_out)
x = self.norm2(x)
return x
Training Large Language Models
- MLM: BERT-style, mask 15% tokens
- Autoregressive (CLM): GPT-style, predict next token
- Span corruption: T5-style
- Full fine-tuning
- LoRA: Low-rank adapters
- Prefix tuning, Adapters
# LoRA: W = W_original + B*A, only B, A trainable
class LoRALayer(nn.Linear):
def __init__(self, in_features, out_features, rank=4):
super().__init__(in_features, out_features)
self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
self.requires_grad_(False) # freeze original weights
def forward(self, x):
return super().forward(x) + x @ self.lora_A @ self.lora_B
Transformers Beyond Text
Vision
ViT, Swin, DINOv2. Image classification, detection, segmentation.
Audio
Whisper, AudioMAE. Speech recognition, generation.
Biology
AlphaFold2, ESM. Protein folding, sequences.
Reinforcement Learning
Decision Transformer, GATO.
Multimodal
CLIP, Flamingo, LLaVA, GPT-4V.
Time Series
Informer, Autoformer.
Transformer Variants & Use Cases – Cheatsheet
Transformer Model Comparison
| Model | Year | Architecture | Params | Key Innovation |
|---|---|---|---|---|
| Transformer | 2017 | Enc-Dec | 65M | Self-attention, no RNN/CNN |
| BERT | 2018 | Encoder | 110M-340M | Masked LM, bidirectional |
| GPT-3 | 2020 | Decoder | 175B | Few-shot, in-context learning |
| T5 | 2019 | Enc-Dec | 11B | Text-to-text unified |
| ViT | 2020 | Encoder | 86M-632M | Image patches as tokens |
| Llama 2 | 2023 | Decoder | 7B-70B | Open, efficient, Grouped-Query Attention |