Transfer Learning: Stand on the Shoulders of Giants

Why train from scratch when you can leverage models trained on massive datasets? Transfer learning enables faster training, less data, and state-of-the-art performance by adapting pretrained knowledge to your specific task.

ImageNet

14M images, 1000 classes

BERT

3.3B words

10x faster

Training speedup

SOTA

95%+ tasks

What is Transfer Learning?

Transfer learning = knowledge transfer from a source task (large dataset) to a target task (your specific problem). Instead of random initialization, start with weights learned on ImageNet, BERT, or GPT.

Source: ImageNet â†’ Pretrained Weights â†’ Target: Pneumonia X-ray

Low-level features (edges, textures) transfer universally. High-level features need adaptation.

âœ“ Requires less data

âœ“ Faster convergence

âœ“ Better generalization

âœ“ SOTA even with small datasets

Feature Extraction vs Fine-Tuning

Feature Extraction

Freeze pretrained backbone â€“ no gradient updates. Only train new classifier head.

Treat model as fixed feature extractor
Fast, low memory
Works when target is similar to source
When: Small dataset, limited compute

Fine-Tuning

Unfreeze some/all layers and continue training on target data.

Adapts features to target domain
Higher accuracy if enough data
Risk of overfitting with tiny datasets
When: Larger dataset, domain shift

Best practice: Start with feature extraction, then gradually unfreeze and fine-tune with low learning rate.

Computer Vision â€“ PyTorch & TensorFlow

TorchVision and Keras Applications provide dozens of ImageNet-pretrained models.

PyTorch â€“ ResNet50 Feature Extraction

import torchvision.models as models

# Load pretrained ResNet50
model = models.resnet50(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier (new task: 10 classes)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10)

# Train only the new head
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

# For fine-tuning: unfreeze later
for param in model.layer4.parameters():
    param.requires_grad = True

TensorFlow â€“ EfficientNetB0

import tensorflow as tf

# Load pretrained backbone
base_model = tf.keras.applications.EfficientNetB0(
    include_top=False,  # drop classifier
    weights='imagenet',
    input_shape=(224, 224, 3),
    pooling='avg'  # global average pooling
)

# Freeze base model
base_model.trainable = False

# Add new classifier
model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Fine-tuning later
base_model.trainable = True
# Freeze first 100 layers...
for layer in base_model.layers[:100]:
    layer.trainable = False

Popular backbones: ResNet, VGG, DenseNet, EfficientNet, MobileNet, ViT

NLP Transfer Learning â€“ BERT & Friends

Hugging Face transformers library is the industry standard for NLP transfer learning.

BERT Fine-Tuning

from transformers import BertForSequenceClassification, Trainer

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=3  # sentiment classes
)

# Freeze embeddings (optional)
for param in model.bert.embeddings.parameters():
    param.requires_grad = False

# Fine-tune all encoder layers
# Discriminative learning rates
optimizer = torch.optim.AdamW([
    {'params': model.bert.encoder.layer[:6].parameters(), 'lr': 1e-5},
    {'params': model.bert.encoder.layer[6:].parameters(), 'lr': 2e-5},
    {'params': model.classifier.parameters(), 'lr': 3e-5}
])

GPT for Text Generation

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Fine-tune on domain-specific corpus
# Use causal LM loss
# Prompt-based learning / few-shot

LLM adaptation: LoRA, prefix tuning, adapters

Pro tip: Use AutoModelForSequenceClassification and Trainer API for seamless fine-tuning.

Freezing Strategies & Gradual Unfreezing

Not all layers transfer equally. Early layers learn general features, later layers are task-specific.

Conv1

Conv2

Conv3

Conv4

â– Frozen â– Fine-tune (low LR) â– Train from scratch

Option 1: Only train head

Dataset: 100-1000 images. Fast, minimal overfitting.

Option 2: Unfreeze top block

Dataset: 1k-10k images. Gradually unfreeze from top.

Option 3: Full fine-tuning

Dataset: 10k+ images. Use 10x lower LR than scratch.

Gradual Unfreezing (ULMFiT style)

# Phase 1: Train only classifier
for epoch in range(5):
    train_step(classifier_params)

# Phase 2: Unfreeze last layer group
for param in model.layer4.parameters():
    param.requires_grad = True
optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-5},
    {'params': classifier.parameters(), 'lr': 1e-4}
])

# Phase 3: Unfreeze all, lower LR
for param in model.parameters():
    param.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

Discriminative Learning Rates

Different layers need different learning rates. Lower LR for pretrained base, higher for new head.

Why?

Pretrained weights already good â†’ small updates. New head random â†’ larger steps.

LR_base = LR_head / 10 to / 100

# PyTorch â€“ per-parameter LR
optimizer = torch.optim.AdamW([
    {'params': model.conv1.parameters(), 'lr': 1e-6},
    {'params': model.conv2.parameters(), 'lr': 5e-6},
    {'params': model.fc.parameters(), 'lr': 1e-4}
])

# fast.ai â€“ slice
learn.fit_one_cycle(5, slice(1e-6, 1e-4))

# TensorFlow â€“ different LR per layer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
# Apply different LR via gradient tape

Domain Shift & Catastrophic Forgetting

âš ï¸ Catastrophic Forgetting: Fine-tuning on new task overwrites pretrained knowledge.

Solution: Lower LR, freeze early layers, regularization (EWC, L2).

âœ… Domain Adaptation: When source (ImageNet) and target (medical) are very different.

Solution: Fine-tune more layers, use adversarial DA, or train from scratch if too different.

When NOT to use transfer learning?

Target domain is extremely different (e.g., medical imaging vs natural images) â€” partial transfer still helps low-level features.
You have massive dataset (millions) â€” training from scratch may be viable.
Latency constraints require tiny models â€” but you can distill from larger pretrained.

Parameter-Efficient Transfer Learning

For huge models (LLMs), full fine-tuning is expensive. Adapters and LoRA inject small trainable modules.

LoRA

Low-rank adaptation: AÂ·B matrices parallel to weights. Only train these.

1% parameters GPT-3 fine-tuning

Adapter

Small bottleneck MLP inserted between transformer layers.

Prefix Tuning

Learn virtual tokens prepended to input. Keeps base model frozen.

# LoRA with Hugging Face PEFT
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained("t5-large")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~6% parameters

Transfer Learning Strategies Comparison

Strategy	Frozen Layers	Data Required	Training Time	Accuracy	Use Case
Feature Extraction	All except head	â Very low	âš¡ Fast	ðŸ“Š Good	Small dataset, similar domain
Partial fine-tune (top block)	Early layers	ââ Low	âš¡âš¡ Medium	ðŸ“ŠðŸ“Š Better	Medium dataset, some domain shift
Full fine-tuning	None	âââ High	âš¡âš¡âš¡ Slow	ðŸ“ŠðŸ“ŠðŸ“Š Best	Large dataset, significant shift
Gradual unfreezing	Dynamic	ââ Medium	âš¡âš¡âš¡ Slow	ðŸ“ŠðŸ“ŠðŸ“Š SOTA	NLP, ULMFiT style
Adapter/LoRA	All (frozen)	ââ Low	âš¡ Fast	ðŸ“ŠðŸ“Š Near SOTA	LLMs, multi-task

Real-World Transfer Learning

Medical Imaging

CheXNet: DenseNet-121 pretrained on ImageNet, fine-tuned on ChestX-ray14. Radiologist-level pneumonia detection with only 100k images.

Autonomous Driving

Pretrained on Cityscapes, fine-tuned on specific camera setups. 90% less data needed.

Sentiment Analysis

BERT fine-tuned on 10k labeled reviews matches LSTM trained on 1M+ reviews.

Robotics

Sim2Real: train in simulation (source), fine-tune on real robot (target).

Deploying Transfer Learning Models

Pretrained models are large. Optimize for production.

Quantization

INT8 reduces size 4x, speeds up inference.

Pruning

Remove unimportant weights. 50%+ compression.

Knowledge Distillation

Train small student on teacher's predictions.

# Distillation example
teacher = models.resnet50(pretrained=True)
student = models.resnet18()

# Train student with soft labels from teacher
logits = teacher(images)
soft_labels = F.softmax(logits / temperature, dim=1)
student_loss = KL_div(student_logits, soft_labels)

Transfer Learning Cheatsheetpretrained ImageNet, BERT
freeze requires_grad=False
fine-tune unfreeze + low LR
discriminative LR layer-wise
gradual unfreeze ULMFiT
LoRA low-rank adapt
head new classifier
catastrophic forgetting regularize

Next: Attention

Related Deep Learning Links