Transformers: Attention Is All You Need

The Transformer architecture revolutionized deep learning by replacing recurrence with self-attention. It powers every modern LLM â€” from BERT and GPT to Llama and Gemini. Complete guide covering mathematics, implementation, and variants.

Self-Attention

Q, K, V matrices

Multi-Head

Parallel attention

Positional Encoding

Sequence order

Encoder-Decoder

Flexible architecture

What is a Transformer?

A Transformer is a deep learning architecture that relies entirely on self-attention to model relationships in sequences. Introduced in 2017 by Vaswani et al., it abandoned recurrence (RNNs) and convolution (CNNs) in favor of parallelizable attention mechanisms. It's the foundation of BERT, GPT, T5, Vision Transformers, and virtually all large language models.

Input Sequence â†’ Embedding â†’ + Positional Encoding â†’
[Encoder Block Ã— N]
â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚ Multi-Head â”‚
â”‚ Self-Attentionâ”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”€â”€â”˜
â†“ + & Norm
â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚ Feed-Forward â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”˜
â†“ + & Norm
â†’ Output Probabilities

Transformers process all tokens in parallel. Attention maps global dependencies.

Scaled Dot-Product Attention

Attention Formula

Attention(Q, K, V) = softmax(QKáµ€ / âˆšdâ‚–) V

Q: Queries, K: Keys, V: Values.
âˆšdâ‚–: scaling factor to prevent dot products from growing large.

Each token attends to all tokens. Weighted sum of values.

Self-Attention vs Cross-Attention

Self-attention Q, K, V from same sequence (encoder, decoder self-attention).

Cross-attention Q from decoder, K, V from encoder (encoder-decoder attention).

Masked Self-Attention

Prevents attending to future tokens. Used in autoregressive decoders (GPT). Set attention scores to -âˆž before softmax.

Scaled Dot-Product Attention from scratch (NumPy)

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (..., seq_len, d_k)
    mask: (..., seq_len, seq_len) optional
    """
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0,1,3,2)) / np.sqrt(d_k)  # (..., seq_len, seq_len)
    
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # causal mask
    
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

Multi-Head Attention

Instead of single attention, project Q, K, V h times with different learned linear projections, perform attention in parallel, concatenate, and project again.

Why multiple heads?

Each head learns different attention patterns: local, global, syntactic, semantic. Standard: h=8, 12, 16, 32 for large models.

MultiHead(Q,K,V)

Concat(headâ‚,...,headâ‚•)Wá´¼

headáµ¢ = Attention(QWáµ¢Q, KWáµ¢K, VWáµ¢V)

Intuition: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Positional Encoding: Injecting Order

Sinusoidal Encodings

PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Fixed, no learning. Enables extrapolation.

Learned Positional Embeddings

Trainable vector per position (BERT, GPT). Simpler, but limited to max length.

Modern variants: RoPE (Rotary), ALiBi (attention bias).

Sinusoidal Positional Encoding (PyTorch)

import torch
import math

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        -(math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe.unsqueeze(0)  # (1, seq_len, d_model)

Encoder-Decoder & Variants

Encoder-Only

BERT, RoBERTa, DeBERTa. Bidirectional context. Best for understanding tasks: classification, NER, extraction.

Decoder-Only

GPT, Llama, Mistral, Gemini. Autoregressive. Best for generation. Causal masking.

Encoder-Decoder

T5, BART. Sequence-to-sequence. Best for translation, summarization.

Encoder: Self-attention + FFN. Decoder: Masked self-attention + cross-attention + FFN.

Iconic Transformer Models (2017â€“2025)

BERT (2018)

Bidirectional Encoder. Masked LM + Next Sentence Prediction. 110Mâ€“340M params.

GPT-3 (2020)

Autoregressive decoder. 175B params. In-context learning.

T5 (2019)

Text-to-Text Transfer Transformer. Unified framework.

Vision Transformer (ViT) 2020

Split image into patches, treat as sequence. No convolutions.

Llama (2023)

Open-source, efficient. RMSNorm, SwiGLU, RoPE.

Mixture of Experts

Switch Transformer, Mistral. Sparse activation.

Transformer Block in PyTorch

Complete Transformer Encoder Layer

import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = F.relu
        
    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        # Self-attention block with residual + norm
        x = src
        attn_out, _ = self.self_attn(x, x, x, attn_mask=src_mask,
                                     key_padding_mask=src_key_padding_mask)
        x = x + self.dropout1(attn_out)
        x = self.norm1(x)
        
        # Feedforward block with residual + norm
        ff_out = self.linear2(self.dropout2(self.activation(self.linear1(x))))
        x = x + self.dropout2(ff_out)
        x = self.norm2(x)
        return x

Training Large Language Models

ðŸ“š Pretraining objectives:

MLM: BERT-style, mask 15% tokens
Autoregressive (CLM): GPT-style, predict next token
Span corruption: T5-style

âš¡ Fine-tuning strategies:

Full fine-tuning
LoRA: Low-rank adapters
Prefix tuning, Adapters

LoRA-style parameter-efficient fine-tuning (conceptual)

# LoRA: W = W_original + B*A, only B, A trainable
class LoRALayer(nn.Linear):
    def __init__(self, in_features, out_features, rank=4):
        super().__init__(in_features, out_features)
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.requires_grad_(False)  # freeze original weights
    
    def forward(self, x):
        return super().forward(x) + x @ self.lora_A @ self.lora_B

Transformers Beyond Text

Vision

ViT, Swin, DINOv2. Image classification, detection, segmentation.

Audio

Whisper, AudioMAE. Speech recognition, generation.

Biology

AlphaFold2, ESM. Protein folding, sequences.

Reinforcement Learning

Decision Transformer, GATO.

Multimodal

CLIP, Flamingo, LLaVA, GPT-4V.

Time Series

Informer, Autoformer.

Transformer Variants & Use Cases â€“ CheatsheetBERT Encoder (understanding)
GPT Decoder (generation)
T5 Encoder-Decoder (seq2seq)
ViT Vision
Whisper Speech
RoPE Positional encoding
SwiGLU Activation
LoRA Efficient tuning

Transformer Model Comparison

Model	Year	Architecture	Params	Key Innovation
Transformer	2017	Enc-Dec	65M	Self-attention, no RNN/CNN
BERT	2018	Encoder	110M-340M	Masked LM, bidirectional
GPT-3	2020	Decoder	175B	Few-shot, in-context learning
T5	2019	Enc-Dec	11B	Text-to-text unified
ViT	2020	Encoder	86M-632M	Image patches as tokens
Llama 2	2023	Decoder	7B-70B	Open, efficient, Grouped-Query Attention

Transformer Pitfalls & Debugging

âš ï¸ Quadratric complexity: O(nÂ²) for attention. Use sparse attention, Linformer, or long-sequence optimizers.

âš ï¸ Training instability: Warmup (5-10% steps), gradient clipping, Adam betas=(0.9, 0.98).

âœ… Positional encoding: For very long sequences, use RoPE or ALiBi.

âœ… Debug attention: Visualize attention maps â€“ should be diffuse, not degenerate.

Next: GANs

Related Deep Learning Links

Transformers: Attention Is All You Need

Self-Attention

Multi-Head

Positional Encoding

Encoder-Decoder

What is a Transformer?

Scaled Dot-Product Attention

Attention Formula

Self-Attention vs Cross-Attention

Masked Self-Attention

Multi-Head Attention

Why multiple heads?

MultiHead(Q,K,V)

Positional Encoding: Injecting Order

Sinusoidal Encodings

Learned Positional Embeddings

Encoder-Decoder & Variants

Encoder-Only

Decoder-Only

Encoder-Decoder

Iconic Transformer Models (2017â€“2025)

BERT (2018)

GPT-3 (2020)

T5 (2019)

Vision Transformer (ViT) 2020

Llama (2023)

Mixture of Experts

Transformer Block in PyTorch

Training Large Language Models

Transformers Beyond Text

Vision

Audio

Biology

Reinforcement Learning

Multimodal

Time Series

Transformer Variants & Use Cases â€“ Cheatsheet

Transformer Model Comparison

Transformer Pitfalls & Debugging