Transfer Learning 20 Essential Q/A
FAANG ML Interview

Transfer Learning: 20 Interview Questions

Master fine-tuning, feature extraction, domain adaptation, catastrophic forgetting, and pretrained models (ImageNet, BERT, GPT). Interview-ready answers with strategies and trade-offs.

Feature Extraction Fine-Tuning Domain Adaptation Catastrophic Forgetting Pretrained Models Few-Shot
1 What is transfer learning? Why is it important? ⚡ Easy
Answer: Transfer learning leverages knowledge learned from a source task (usually large dataset) to improve learning on a target task (usually smaller dataset). Critical for data efficiency, faster convergence, and better performance when labeled data is scarce.
"Standing on the shoulders of giants" – reuse features from models trained on massive datasets (ImageNet, Wikipedia).
2 Types of transfer learning: inductive, transductive, unsupervised? 🔥 Hard
Answer:
  • Inductive: source and target tasks different, domains same or different. Fine-tuning is inductive.
  • Transductive: tasks same, domains different (domain adaptation).
  • Unsupervised: no labeled data in either task (e.g., self-supervised pretraining).
3 Feature extraction (frozen backbone) vs fine-tuning – trade-offs? 📊 Medium
Answer: Feature extraction: freeze pretrained weights, train only new head. Faster, less overfitting, good for small data/similar domains. Fine-tuning: unfreeze some/all layers, train end-to-end. Better for large data/different domains, risk of catastrophic forgetting.
Feature extraction Low data, similar task, cheap Fine-tuning More data, dissimilar task, compute available
4 How to decide which layers to freeze/unfreeze? 🔥 Hard
Answer: Lower layers learn generic features (edges, textures), higher layers task-specific. Freeze lower layers if target dataset small or similar to source. Gradually unfreeze from top-down (discriminative fine-tuning). If domain shift large, unfreeze more.
5 What is catastrophic forgetting? How to prevent it? 🔥 Hard
Answer: Neural networks may overwrite previously learned knowledge when fine-tuning on new tasks. Mitigation: lower learning rates, freeze early layers, elastic weight consolidation (EWC), learning without forgetting, replay buffers, gradual unfreezing.
6 What is domain adaptation? When needed? 🔥 Hard
Answer: Domain adaptation addresses distribution shift between source and target domains (same task, different data distributions). Approaches: adversarial domain adaptation (gradient reversal), CORAL, self-training, data alignment. Used in sim-to-real, cross-lingual transfer.
7 Criteria for selecting a pretrained model? 📊 Medium
Answer: Source domain similarity to target, dataset size (ImageNet vs JFT), architecture efficiency, performance on relevant benchmarks, input size compatibility, license, and framework support.
8 What is self-supervised learning for transfer? Examples? 🔥 Hard
Answer: Pretrain on unlabeled data via pretext tasks (contrastive learning, masking). SimCLR: maximize agreement between augmented views. MAE: mask autoencoding. BERT: masked LM. Transfers well to downstream tasks, reduces need for labeled data.
9 How does transfer learning enable few-shot learning? 📊 Medium
Answer: Pretrained models provide strong feature extractors. With a good feature space, a simple classifier (linear probe, prototype) can generalize from few examples. Meta-learning also builds on transferable representations.
10 What is negative transfer? How to detect/avoid? 🔥 Hard
Answer: When transferring knowledge harms target performance vs training from scratch. Causes: source task too dissimilar, misleading features, domain shift. Detect via validation performance. Avoid by careful model selection, layer freezing, regularization, or using smaller learning rates.
11 Explain progressive resizing and gradual unfreezing. 📊 Medium
Answer: Progressive resizing: start training with smaller image size, increase gradually. Stabilizes learning, faster. Gradual unfreezing: initially freeze all but head, then progressively unfreeze layers from top during training. Both improve fine-tuning.
12 What are discriminative learning rates? 🔥 Hard
Answer: Using different learning rates for different layers. Lower LR for early layers (preserve generic features), higher LR for later layers (adapt task-specific). Implemented via parameter groups in optimizers. Used in ULMFiT.
optimizer = Adam([
    {'params': model.base.parameters(), 'lr': 1e-5},
    {'params': model.classifier.parameters(), 'lr': 1e-3}
])
13 How is transfer learning different in NLP vs vision? 📊 Medium
Answer: NLP: pretrain on large text corpus (language modeling), fine-tune on downstream tasks. BERT bidirectional, GPT causal. Usually fine-tune entire model (smaller risk of overfitting). Vision: often freeze early layers, train classifier. Both now trend toward full fine-tuning.
14 Transfer learning vs multitask learning? 📊 Medium
Answer: Transfer: sequential (source → target). Multitask: simultaneous learning of multiple tasks, sharing representations. Transfer focuses on target; multitask aims to improve all tasks via shared inductive bias.
15 Metrics to measure transfer learning success? 📊 Medium
Answer: Target task accuracy/AUC, convergence speed (epochs to target performance), data efficiency (performance vs training size), negative transfer detection (compare to scratch baseline), and transfer ratio.
16 When would you NOT use transfer learning? 📊 Medium
Answer: Target domain extremely different (medical images vs natural images), very large target dataset available, custom architecture not supported by pretrained models, or when pretrained model has biased/unsafe features.
17 What are adapters? Why use them? 🔥 Hard
Answer: Adapters are small trainable modules inserted between frozen pretrained layers. Enables multi-task serving, prevents catastrophic forgetting, parameter-efficient. Used in NLP (Houlsby adapters), LoRA (low-rank adaptation), prefix tuning.
18 Explain LoRA. Why is it popular? 🔥 Hard
Answer: LoRA injects trainable low-rank matrices into attention layers, approximating weight updates. No inference latency, reduces memory footprint, often matches full fine-tuning. Widely used for LLMs.
W' = W + BA, where B∈R^{d×r}, A∈R^{r×k}, r << min(d,k)
19 How does cross-lingual transfer work? (mBERT, XLM-R) 🔥 Hard
Answer: Multilingual models pretrained on concatenated corpora from many languages (shared vocabulary, aligned representations). Fine-tune on high-resource language, zero-shot transfer to low-resource languages. Relies on shared subword units and contextualization.
20 Challenges of transfer learning in RL? 🔥 Hard
Answer: Different dynamics, reward functions, observation spaces. Common approaches: policy transfer, value function transfer, feature reuse. Challenges: negative transfer, catastrophic forgetting, exploration-exploitation. Sim-to-real via domain randomization.

Transfer Learning – Interview Cheat Sheet

Fine-Tuning Strategies
  • Linear probe Train classifier only
  • Full fine-tune All layers
  • Gradual unfreeze Top-down
  • Discriminative LR Layer-wise rates
  • LoRA/adapters Parameter-efficient
Risks
  • Catastrophic forgetting EWC, replay
  • Negative transfer Validation check
  • Domain shift Adaptation
Popular Pretrained Models
  • ImageNet ResNet, EfficientNet, ViT
  • BERT NLP understanding
  • GPT Generation
  • CLIP Vision-language
  • Wav2Vec Speech
When Transfer Works Best
  • Small target dataset
  • Similar low-level features
  • Compute constraints
  • Strong source task performance

Verdict: "Transfer learning is the default for modern DL – know when to freeze, when to adapt."