Transfer Learning: Stand on the Shoulders of Giants
Why train from scratch when you can leverage models trained on massive datasets? Transfer learning enables faster training, less data, and state-of-the-art performance by adapting pretrained knowledge to your specific task.
ImageNet
14M images, 1000 classes
BERT
3.3B words
10x faster
Training speedup
SOTA
95%+ tasks
What is Transfer Learning?
Transfer learning = knowledge transfer from a source task (large dataset) to a target task (your specific problem). Instead of random initialization, start with weights learned on ImageNet, BERT, or GPT.
Low-level features (edges, textures) transfer universally. High-level features need adaptation.
Feature Extraction vs Fine-Tuning
Feature Extraction
Freeze pretrained backbone – no gradient updates. Only train new classifier head.
- Treat model as fixed feature extractor
- Fast, low memory
- Works when target is similar to source
- When: Small dataset, limited compute
Fine-Tuning
Unfreeze some/all layers and continue training on target data.
- Adapts features to target domain
- Higher accuracy if enough data
- Risk of overfitting with tiny datasets
- When: Larger dataset, domain shift
Computer Vision – PyTorch & TensorFlow
TorchVision and Keras Applications provide dozens of ImageNet-pretrained models.
import torchvision.models as models
# Load pretrained ResNet50
model = models.resnet50(pretrained=True)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier (new task: 10 classes)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10)
# Train only the new head
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
# For fine-tuning: unfreeze later
for param in model.layer4.parameters():
param.requires_grad = True
import tensorflow as tf
# Load pretrained backbone
base_model = tf.keras.applications.EfficientNetB0(
include_top=False, # drop classifier
weights='imagenet',
input_shape=(224, 224, 3),
pooling='avg' # global average pooling
)
# Freeze base model
base_model.trainable = False
# Add new classifier
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
# Fine-tuning later
base_model.trainable = True
# Freeze first 100 layers...
for layer in base_model.layers[:100]:
layer.trainable = False
NLP Transfer Learning – BERT & Friends
Hugging Face transformers library is the industry standard for NLP transfer learning.
BERT Fine-Tuning
from transformers import BertForSequenceClassification, Trainer
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=3 # sentiment classes
)
# Freeze embeddings (optional)
for param in model.bert.embeddings.parameters():
param.requires_grad = False
# Fine-tune all encoder layers
# Discriminative learning rates
optimizer = torch.optim.AdamW([
{'params': model.bert.encoder.layer[:6].parameters(), 'lr': 1e-5},
{'params': model.bert.encoder.layer[6:].parameters(), 'lr': 2e-5},
{'params': model.classifier.parameters(), 'lr': 3e-5}
])
GPT for Text Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Fine-tune on domain-specific corpus
# Use causal LM loss
# Prompt-based learning / few-shot
LLM adaptation: LoRA, prefix tuning, adapters
AutoModelForSequenceClassification and Trainer API for seamless fine-tuning.
Freezing Strategies & Gradual Unfreezing
Not all layers transfer equally. Early layers learn general features, later layers are task-specific.
■ Frozen ■ Fine-tune (low LR) ■ Train from scratch
Option 1: Only train head
Dataset: 100-1000 images. Fast, minimal overfitting.
Option 2: Unfreeze top block
Dataset: 1k-10k images. Gradually unfreeze from top.
Option 3: Full fine-tuning
Dataset: 10k+ images. Use 10x lower LR than scratch.
# Phase 1: Train only classifier
for epoch in range(5):
train_step(classifier_params)
# Phase 2: Unfreeze last layer group
for param in model.layer4.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW([
{'params': model.layer4.parameters(), 'lr': 1e-5},
{'params': classifier.parameters(), 'lr': 1e-4}
])
# Phase 3: Unfreeze all, lower LR
for param in model.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
Discriminative Learning Rates
Different layers need different learning rates. Lower LR for pretrained base, higher for new head.
Why?
Pretrained weights already good → small updates. New head random → larger steps.
LR_base = LR_head / 10 to / 100
# PyTorch – per-parameter LR
optimizer = torch.optim.AdamW([
{'params': model.conv1.parameters(), 'lr': 1e-6},
{'params': model.conv2.parameters(), 'lr': 5e-6},
{'params': model.fc.parameters(), 'lr': 1e-4}
])
# fast.ai – slice
learn.fit_one_cycle(5, slice(1e-6, 1e-4))
# TensorFlow – different LR per layer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
# Apply different LR via gradient tape
Domain Shift & Catastrophic Forgetting
Solution: Lower LR, freeze early layers, regularization (EWC, L2).
Solution: Fine-tune more layers, use adversarial DA, or train from scratch if too different.
When NOT to use transfer learning?
- Target domain is extremely different (e.g., medical imaging vs natural images) — partial transfer still helps low-level features.
- You have massive dataset (millions) — training from scratch may be viable.
- Latency constraints require tiny models — but you can distill from larger pretrained.
Parameter-Efficient Transfer Learning
For huge models (LLMs), full fine-tuning is expensive. Adapters and LoRA inject small trainable modules.
LoRA
Low-rank adaptation: A·B matrices parallel to weights. Only train these.
1% parameters GPT-3 fine-tuning
Adapter
Small bottleneck MLP inserted between transformer layers.
Prefix Tuning
Learn virtual tokens prepended to input. Keeps base model frozen.
# LoRA with Hugging Face PEFT
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-large")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # ~6% parameters
Transfer Learning Strategies Comparison
| Strategy | Frozen Layers | Data Required | Training Time | Accuracy | Use Case |
|---|---|---|---|---|---|
| Feature Extraction | All except head | ⭐ Very low | ⚡ Fast | 📊 Good | Small dataset, similar domain |
| Partial fine-tune (top block) | Early layers | ⭐⭐ Low | ⚡⚡ Medium | 📊📊 Better | Medium dataset, some domain shift |
| Full fine-tuning | None | ⭐⭐⭐ High | ⚡⚡⚡ Slow | 📊📊📊 Best | Large dataset, significant shift |
| Gradual unfreezing | Dynamic | ⭐⭐ Medium | ⚡⚡⚡ Slow | 📊📊📊 SOTA | NLP, ULMFiT style |
| Adapter/LoRA | All (frozen) | ⭐⭐ Low | ⚡ Fast | 📊📊 Near SOTA | LLMs, multi-task |
Real-World Transfer Learning
Medical Imaging
CheXNet: DenseNet-121 pretrained on ImageNet, fine-tuned on ChestX-ray14. Radiologist-level pneumonia detection with only 100k images.
Autonomous Driving
Pretrained on Cityscapes, fine-tuned on specific camera setups. 90% less data needed.
Sentiment Analysis
BERT fine-tuned on 10k labeled reviews matches LSTM trained on 1M+ reviews.
Robotics
Sim2Real: train in simulation (source), fine-tune on real robot (target).
Deploying Transfer Learning Models
Pretrained models are large. Optimize for production.
INT8 reduces size 4x, speeds up inference.
Remove unimportant weights. 50%+ compression.
Train small student on teacher's predictions.
# Distillation example
teacher = models.resnet50(pretrained=True)
student = models.resnet18()
# Train student with soft labels from teacher
logits = teacher(images)
soft_labels = F.softmax(logits / temperature, dim=1)
student_loss = KL_div(student_logits, soft_labels)