Transformers NLP & Beyond
LLM Foundation Attention

Transformers: Attention Is All You Need

The Transformer architecture revolutionized deep learning by replacing recurrence with self-attention. It powers every modern LLM — from BERT and GPT to Llama and Gemini. Complete guide covering mathematics, implementation, and variants.

Self-Attention

Q, K, V matrices

Multi-Head

Parallel attention

Positional Encoding

Sequence order

Encoder-Decoder

Flexible architecture

What is a Transformer?

A Transformer is a deep learning architecture that relies entirely on self-attention to model relationships in sequences. Introduced in 2017 by Vaswani et al., it abandoned recurrence (RNNs) and convolution (CNNs) in favor of parallelizable attention mechanisms. It's the foundation of BERT, GPT, T5, Vision Transformers, and virtually all large language models.

Input Sequence Embedding + Positional Encoding
[Encoder Block × N]
┌─────────────────┐
│ Multi-Head │
│ Self-Attention│
└────────┬────────┘
↓ + & Norm
┌───────────────┐
│ Feed-Forward │
└────────┬──────┘
↓ + & Norm
→ Output Probabilities

Transformers process all tokens in parallel. Attention maps global dependencies.

Scaled Dot-Product Attention

Attention Formula

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Q: Queries, K: Keys, V: Values.
√dₖ: scaling factor to prevent dot products from growing large.

Each token attends to all tokens. Weighted sum of values.

Self-Attention vs Cross-Attention

Self-attention Q, K, V from same sequence (encoder, decoder self-attention).

Cross-attention Q from decoder, K, V from encoder (encoder-decoder attention).

Masked Self-Attention

Prevents attending to future tokens. Used in autoregressive decoders (GPT). Set attention scores to -∞ before softmax.

Scaled Dot-Product Attention from scratch (NumPy)
import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (..., seq_len, d_k)
    mask: (..., seq_len, seq_len) optional
    """
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0,1,3,2)) / np.sqrt(d_k)  # (..., seq_len, seq_len)
    
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # causal mask
    
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

Multi-Head Attention

Instead of single attention, project Q, K, V h times with different learned linear projections, perform attention in parallel, concatenate, and project again.

Why multiple heads?

Each head learns different attention patterns: local, global, syntactic, semantic. Standard: h=8, 12, 16, 32 for large models.

MultiHead(Q,K,V)

Concat(head₁,...,headₕ)Wᴼ

headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)

Intuition: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Positional Encoding: Injecting Order

Sinusoidal Encodings

PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Fixed, no learning. Enables extrapolation.

Learned Positional Embeddings

Trainable vector per position (BERT, GPT). Simpler, but limited to max length.

Modern variants: RoPE (Rotary), ALiBi (attention bias).

Sinusoidal Positional Encoding (PyTorch)
import torch
import math

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        -(math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe.unsqueeze(0)  # (1, seq_len, d_model)

Encoder-Decoder & Variants

Encoder-Only

BERT, RoBERTa, DeBERTa. Bidirectional context. Best for understanding tasks: classification, NER, extraction.

Decoder-Only

GPT, Llama, Mistral, Gemini. Autoregressive. Best for generation. Causal masking.

Encoder-Decoder

T5, BART. Sequence-to-sequence. Best for translation, summarization.

Encoder: Self-attention + FFN. Decoder: Masked self-attention + cross-attention + FFN.

Iconic Transformer Models (2017–2025)

BERT (2018)

Bidirectional Encoder. Masked LM + Next Sentence Prediction. 110M–340M params.

GPT-3 (2020)

Autoregressive decoder. 175B params. In-context learning.

T5 (2019)

Text-to-Text Transfer Transformer. Unified framework.

Vision Transformer (ViT) 2020

Split image into patches, treat as sequence. No convolutions.

Llama (2023)

Open-source, efficient. RMSNorm, SwiGLU, RoPE.

Mixture of Experts

Switch Transformer, Mistral. Sparse activation.

Transformer Block in PyTorch

Complete Transformer Encoder Layer
import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = F.relu
        
    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        # Self-attention block with residual + norm
        x = src
        attn_out, _ = self.self_attn(x, x, x, attn_mask=src_mask,
                                     key_padding_mask=src_key_padding_mask)
        x = x + self.dropout1(attn_out)
        x = self.norm1(x)
        
        # Feedforward block with residual + norm
        ff_out = self.linear2(self.dropout2(self.activation(self.linear1(x))))
        x = x + self.dropout2(ff_out)
        x = self.norm2(x)
        return x

Training Large Language Models

📚 Pretraining objectives:
  • MLM: BERT-style, mask 15% tokens
  • Autoregressive (CLM): GPT-style, predict next token
  • Span corruption: T5-style
⚡ Fine-tuning strategies:
  • Full fine-tuning
  • LoRA: Low-rank adapters
  • Prefix tuning, Adapters
LoRA-style parameter-efficient fine-tuning (conceptual)
# LoRA: W = W_original + B*A, only B, A trainable
class LoRALayer(nn.Linear):
    def __init__(self, in_features, out_features, rank=4):
        super().__init__(in_features, out_features)
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.requires_grad_(False)  # freeze original weights
    
    def forward(self, x):
        return super().forward(x) + x @ self.lora_A @ self.lora_B

Transformers Beyond Text

Vision

ViT, Swin, DINOv2. Image classification, detection, segmentation.

Audio

Whisper, AudioMAE. Speech recognition, generation.

Biology

AlphaFold2, ESM. Protein folding, sequences.

Reinforcement Learning

Decision Transformer, GATO.

Multimodal

CLIP, Flamingo, LLaVA, GPT-4V.

Time Series

Informer, Autoformer.

Transformer Variants & Use Cases – Cheatsheet

BERT Encoder (understanding)
GPT Decoder (generation)
T5 Encoder-Decoder (seq2seq)
ViT Vision
Whisper Speech
RoPE Positional encoding
SwiGLU Activation
LoRA Efficient tuning

Transformer Model Comparison

Model Year Architecture Params Key Innovation
Transformer2017Enc-Dec65MSelf-attention, no RNN/CNN
BERT2018Encoder110M-340MMasked LM, bidirectional
GPT-32020Decoder175BFew-shot, in-context learning
T52019Enc-Dec11BText-to-text unified
ViT2020Encoder86M-632MImage patches as tokens
Llama 22023Decoder7B-70BOpen, efficient, Grouped-Query Attention

Transformer Pitfalls & Debugging

⚠️ Quadratric complexity: O(n²) for attention. Use sparse attention, Linformer, or long-sequence optimizers.
⚠️ Training instability: Warmup (5-10% steps), gradient clipping, Adam betas=(0.9, 0.98).
✅ Positional encoding: For very long sequences, use RoPE or ALiBi.
✅ Debug attention: Visualize attention maps – should be diffuse, not degenerate.