Transfer Learning Fine-Tuning Pretrained
Freeze Layers Discriminative LR

Transfer Learning: Stand on the Shoulders of Giants

Why train from scratch when you can leverage models trained on massive datasets? Transfer learning enables faster training, less data, and state-of-the-art performance by adapting pretrained knowledge to your specific task.

ImageNet

14M images, 1000 classes

BERT

3.3B words

10x faster

Training speedup

SOTA

95%+ tasks

What is Transfer Learning?

Transfer learning = knowledge transfer from a source task (large dataset) to a target task (your specific problem). Instead of random initialization, start with weights learned on ImageNet, BERT, or GPT.

Source: ImageNet Pretrained Weights Target: Pneumonia X-ray

Low-level features (edges, textures) transfer universally. High-level features need adaptation.

✓ Requires less data
✓ Faster convergence
✓ Better generalization
✓ SOTA even with small datasets

Feature Extraction vs Fine-Tuning

Feature Extraction

Freeze pretrained backbone – no gradient updates. Only train new classifier head.

  • Treat model as fixed feature extractor
  • Fast, low memory
  • Works when target is similar to source
  • When: Small dataset, limited compute
Fine-Tuning

Unfreeze some/all layers and continue training on target data.

  • Adapts features to target domain
  • Higher accuracy if enough data
  • Risk of overfitting with tiny datasets
  • When: Larger dataset, domain shift
Best practice: Start with feature extraction, then gradually unfreeze and fine-tune with low learning rate.

Computer Vision – PyTorch & TensorFlow

TorchVision and Keras Applications provide dozens of ImageNet-pretrained models.

PyTorch – ResNet50 Feature Extraction
import torchvision.models as models

# Load pretrained ResNet50
model = models.resnet50(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier (new task: 10 classes)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10)

# Train only the new head
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

# For fine-tuning: unfreeze later
for param in model.layer4.parameters():
    param.requires_grad = True
TensorFlow – EfficientNetB0
import tensorflow as tf

# Load pretrained backbone
base_model = tf.keras.applications.EfficientNetB0(
    include_top=False,  # drop classifier
    weights='imagenet',
    input_shape=(224, 224, 3),
    pooling='avg'  # global average pooling
)

# Freeze base model
base_model.trainable = False

# Add new classifier
model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Fine-tuning later
base_model.trainable = True
# Freeze first 100 layers...
for layer in base_model.layers[:100]:
    layer.trainable = False
Popular backbones: ResNet, VGG, DenseNet, EfficientNet, MobileNet, ViT

NLP Transfer Learning – BERT & Friends

Hugging Face transformers library is the industry standard for NLP transfer learning.

BERT Fine-Tuning
from transformers import BertForSequenceClassification, Trainer

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=3  # sentiment classes
)

# Freeze embeddings (optional)
for param in model.bert.embeddings.parameters():
    param.requires_grad = False

# Fine-tune all encoder layers
# Discriminative learning rates
optimizer = torch.optim.AdamW([
    {'params': model.bert.encoder.layer[:6].parameters(), 'lr': 1e-5},
    {'params': model.bert.encoder.layer[6:].parameters(), 'lr': 2e-5},
    {'params': model.classifier.parameters(), 'lr': 3e-5}
])
GPT for Text Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Fine-tune on domain-specific corpus
# Use causal LM loss
# Prompt-based learning / few-shot

LLM adaptation: LoRA, prefix tuning, adapters

Pro tip: Use AutoModelForSequenceClassification and Trainer API for seamless fine-tuning.

Freezing Strategies & Gradual Unfreezing

Not all layers transfer equally. Early layers learn general features, later layers are task-specific.

Conv1
Conv2
Conv3
Conv4
FC

■ Frozen ■ Fine-tune (low LR) ■ Train from scratch

Option 1: Only train head

Dataset: 100-1000 images. Fast, minimal overfitting.

Option 2: Unfreeze top block

Dataset: 1k-10k images. Gradually unfreeze from top.

Option 3: Full fine-tuning

Dataset: 10k+ images. Use 10x lower LR than scratch.

Gradual Unfreezing (ULMFiT style)
# Phase 1: Train only classifier
for epoch in range(5):
    train_step(classifier_params)

# Phase 2: Unfreeze last layer group
for param in model.layer4.parameters():
    param.requires_grad = True
optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-5},
    {'params': classifier.parameters(), 'lr': 1e-4}
])

# Phase 3: Unfreeze all, lower LR
for param in model.parameters():
    param.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

Discriminative Learning Rates

Different layers need different learning rates. Lower LR for pretrained base, higher for new head.

Why?

Pretrained weights already good → small updates. New head random → larger steps.

LR_base = LR_head / 10 to / 100

# PyTorch – per-parameter LR
optimizer = torch.optim.AdamW([
    {'params': model.conv1.parameters(), 'lr': 1e-6},
    {'params': model.conv2.parameters(), 'lr': 5e-6},
    {'params': model.fc.parameters(), 'lr': 1e-4}
])

# fast.ai – slice
learn.fit_one_cycle(5, slice(1e-6, 1e-4))

# TensorFlow – different LR per layer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
# Apply different LR via gradient tape

Domain Shift & Catastrophic Forgetting

⚠️ Catastrophic Forgetting: Fine-tuning on new task overwrites pretrained knowledge.

Solution: Lower LR, freeze early layers, regularization (EWC, L2).

✅ Domain Adaptation: When source (ImageNet) and target (medical) are very different.

Solution: Fine-tune more layers, use adversarial DA, or train from scratch if too different.

When NOT to use transfer learning?
  • Target domain is extremely different (e.g., medical imaging vs natural images) — partial transfer still helps low-level features.
  • You have massive dataset (millions) — training from scratch may be viable.
  • Latency constraints require tiny models — but you can distill from larger pretrained.

Parameter-Efficient Transfer Learning

For huge models (LLMs), full fine-tuning is expensive. Adapters and LoRA inject small trainable modules.

LoRA

Low-rank adaptation: A·B matrices parallel to weights. Only train these.

1% parameters GPT-3 fine-tuning

Adapter

Small bottleneck MLP inserted between transformer layers.

Prefix Tuning

Learn virtual tokens prepended to input. Keeps base model frozen.

# LoRA with Hugging Face PEFT
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained("t5-large")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~6% parameters

Transfer Learning Strategies Comparison

Strategy Frozen Layers Data Required Training Time Accuracy Use Case
Feature ExtractionAll except head⭐ Very low⚡ Fast📊 GoodSmall dataset, similar domain
Partial fine-tune (top block)Early layers⭐⭐ Low⚡⚡ Medium📊📊 BetterMedium dataset, some domain shift
Full fine-tuningNone⭐⭐⭐ High⚡⚡⚡ Slow📊📊📊 BestLarge dataset, significant shift
Gradual unfreezingDynamic⭐⭐ Medium⚡⚡⚡ Slow📊📊📊 SOTANLP, ULMFiT style
Adapter/LoRAAll (frozen)⭐⭐ Low⚡ Fast📊📊 Near SOTALLMs, multi-task

Real-World Transfer Learning

Medical Imaging

CheXNet: DenseNet-121 pretrained on ImageNet, fine-tuned on ChestX-ray14. Radiologist-level pneumonia detection with only 100k images.

Autonomous Driving

Pretrained on Cityscapes, fine-tuned on specific camera setups. 90% less data needed.

Sentiment Analysis

BERT fine-tuned on 10k labeled reviews matches LSTM trained on 1M+ reviews.

Robotics

Sim2Real: train in simulation (source), fine-tune on real robot (target).

Deploying Transfer Learning Models

Pretrained models are large. Optimize for production.

Quantization

INT8 reduces size 4x, speeds up inference.

Pruning

Remove unimportant weights. 50%+ compression.

Knowledge Distillation

Train small student on teacher's predictions.

# Distillation example
teacher = models.resnet50(pretrained=True)
student = models.resnet18()

# Train student with soft labels from teacher
logits = teacher(images)
soft_labels = F.softmax(logits / temperature, dim=1)
student_loss = KL_div(student_logits, soft_labels)

Transfer Learning Cheatsheet

pretrained ImageNet, BERT
freeze requires_grad=False
fine-tune unfreeze + low LR
discriminative LR layer-wise
gradual unfreeze ULMFiT
LoRA low-rank adapt
head new classifier
catastrophic forgetting regularize