Related Deep Learning Links
Learn Transfer Learning Deep Learning Tutorial, validate concepts with Transfer Learning Deep Learning MCQ Questions, and prepare interviews through Transfer Learning Deep Learning Interview Questions and Answers.
Transfer Learning: Stand on the Shoulders of Giants
Why train from scratch when you can leverage models trained on massive datasets? Transfer learning enables faster training, less data, and state-of-the-art performance by adapting pretrained knowledge to your specific task.
ImageNet
14M images, 1000 classes
BERT
3.3B words
10x faster
Training speedup
SOTA
95%+ tasks
What is Transfer Learning?
Transfer learning = knowledge transfer from a source task (large dataset) to a target task (your specific problem). Instead of random initialization, start with weights learned on ImageNet, BERT, or GPT.
Low-level features (edges, textures) transfer universally. High-level features need adaptation.
Feature Extraction vs Fine-Tuning
Feature Extraction
Freeze pretrained backbone – no gradient updates. Only train new classifier head.
- Treat model as fixed feature extractor
- Fast, low memory
- Works when target is similar to source
- When: Small dataset, limited compute
Fine-Tuning
Unfreeze some/all layers and continue training on target data.
- Adapts features to target domain
- Higher accuracy if enough data
- Risk of overfitting with tiny datasets
- When: Larger dataset, domain shift
Computer Vision – PyTorch & TensorFlow
TorchVision and Keras Applications provide dozens of ImageNet-pretrained models.
import torchvision.models as models
# Load pretrained ResNet50
model = models.resnet50(pretrained=True)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier (new task: 10 classes)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10)
# Train only the new head
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
# For fine-tuning: unfreeze later
for param in model.layer4.parameters():
param.requires_grad = True
import tensorflow as tf
# Load pretrained backbone
base_model = tf.keras.applications.EfficientNetB0(
include_top=False, # drop classifier
weights='imagenet',
input_shape=(224, 224, 3),
pooling='avg' # global average pooling
)
# Freeze base model
base_model.trainable = False
# Add new classifier
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
# Fine-tuning later
base_model.trainable = True
# Freeze first 100 layers...
for layer in base_model.layers[:100]:
layer.trainable = False
NLP Transfer Learning – BERT & Friends
Hugging Face transformers library is the industry standard for NLP transfer learning.
BERT Fine-Tuning
from transformers import BertForSequenceClassification, Trainer
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=3 # sentiment classes
)
# Freeze embeddings (optional)
for param in model.bert.embeddings.parameters():
param.requires_grad = False
# Fine-tune all encoder layers
# Discriminative learning rates
optimizer = torch.optim.AdamW([
{'params': model.bert.encoder.layer[:6].parameters(), 'lr': 1e-5},
{'params': model.bert.encoder.layer[6:].parameters(), 'lr': 2e-5},
{'params': model.classifier.parameters(), 'lr': 3e-5}
])
GPT for Text Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Fine-tune on domain-specific corpus
# Use causal LM loss
# Prompt-based learning / few-shot
LLM adaptation: LoRA, prefix tuning, adapters
AutoModelForSequenceClassification and Trainer API for seamless fine-tuning.
Freezing Strategies & Gradual Unfreezing
Not all layers transfer equally. Early layers learn general features, later layers are task-specific.
â– Frozen â– Fine-tune (low LR) â– Train from scratch
Option 1: Only train head
Dataset: 100-1000 images. Fast, minimal overfitting.
Option 2: Unfreeze top block
Dataset: 1k-10k images. Gradually unfreeze from top.
Option 3: Full fine-tuning
Dataset: 10k+ images. Use 10x lower LR than scratch.
# Phase 1: Train only classifier
for epoch in range(5):
train_step(classifier_params)
# Phase 2: Unfreeze last layer group
for param in model.layer4.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW([
{'params': model.layer4.parameters(), 'lr': 1e-5},
{'params': classifier.parameters(), 'lr': 1e-4}
])
# Phase 3: Unfreeze all, lower LR
for param in model.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
Discriminative Learning Rates
Different layers need different learning rates. Lower LR for pretrained base, higher for new head.
Why?
Pretrained weights already good → small updates. New head random → larger steps.
LR_base = LR_head / 10 to / 100
# PyTorch – per-parameter LR
optimizer = torch.optim.AdamW([
{'params': model.conv1.parameters(), 'lr': 1e-6},
{'params': model.conv2.parameters(), 'lr': 5e-6},
{'params': model.fc.parameters(), 'lr': 1e-4}
])
# fast.ai – slice
learn.fit_one_cycle(5, slice(1e-6, 1e-4))
# TensorFlow – different LR per layer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
# Apply different LR via gradient tape
Domain Shift & Catastrophic Forgetting
Solution: Lower LR, freeze early layers, regularization (EWC, L2).
Solution: Fine-tune more layers, use adversarial DA, or train from scratch if too different.
When NOT to use transfer learning?
- Target domain is extremely different (e.g., medical imaging vs natural images) — partial transfer still helps low-level features.
- You have massive dataset (millions) — training from scratch may be viable.
- Latency constraints require tiny models — but you can distill from larger pretrained.
Parameter-Efficient Transfer Learning
For huge models (LLMs), full fine-tuning is expensive. Adapters and LoRA inject small trainable modules.
LoRA
Low-rank adaptation: A·B matrices parallel to weights. Only train these.
1% parameters GPT-3 fine-tuning
Adapter
Small bottleneck MLP inserted between transformer layers.
Prefix Tuning
Learn virtual tokens prepended to input. Keeps base model frozen.
# LoRA with Hugging Face PEFT
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-large")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # ~6% parameters
Transfer Learning Strategies Comparison
| Strategy | Frozen Layers | Data Required | Training Time | Accuracy | Use Case |
|---|---|---|---|---|---|
| Feature Extraction | All except head | â Very low | âš¡ Fast | 📊 Good | Small dataset, similar domain |
| Partial fine-tune (top block) | Early layers | ââ Low | ⚡⚡ Medium | 📊📊 Better | Medium dataset, some domain shift |
| Full fine-tuning | None | âââ High | ⚡⚡⚡ Slow | 📊📊📊 Best | Large dataset, significant shift |
| Gradual unfreezing | Dynamic | ââ Medium | ⚡⚡⚡ Slow | 📊📊📊 SOTA | NLP, ULMFiT style |
| Adapter/LoRA | All (frozen) | ââ Low | âš¡ Fast | 📊📊 Near SOTA | LLMs, multi-task |
Real-World Transfer Learning
Medical Imaging
CheXNet: DenseNet-121 pretrained on ImageNet, fine-tuned on ChestX-ray14. Radiologist-level pneumonia detection with only 100k images.
Autonomous Driving
Pretrained on Cityscapes, fine-tuned on specific camera setups. 90% less data needed.
Sentiment Analysis
BERT fine-tuned on 10k labeled reviews matches LSTM trained on 1M+ reviews.
Robotics
Sim2Real: train in simulation (source), fine-tune on real robot (target).
Deploying Transfer Learning Models
Pretrained models are large. Optimize for production.
INT8 reduces size 4x, speeds up inference.
Remove unimportant weights. 50%+ compression.
Train small student on teacher's predictions.
# Distillation example
teacher = models.resnet50(pretrained=True)
student = models.resnet18()
# Train student with soft labels from teacher
logits = teacher(images)
soft_labels = F.softmax(logits / temperature, dim=1)
student_loss = KL_div(student_logits, soft_labels)