Deep Learning Regularization: 20 Interview Questions

Question 1

1 What is regularization in deep learning? Why is it needed? âš¡ Easy

Answer

Answer: Regularization is any technique that reduces generalization error (overfitting) without hurting training error too much. It prevents the model from fitting noise, improving performance on unseen data.

Question 2

2 Explain L2 regularization (weight decay). How does it work? ðŸ“Š Medium

Answer

Answer: L2 adds penalty term Î»/2m * Î£||w||Â² to loss. Shrinks weights toward zero, discouraging complex models. Also called weight decay in optimizers. It forces weights to be small, reducing sensitivity to input noise.

Question 3

3 How is L1 different from L2? Why does L1 yield sparse weights? ðŸ”¥ Hard

Answer

Answer: L1 penalty = Î» Î£ |w|. Derivative is constant Â±Î», pushing weights to zero more aggressively. L2 gives proportional shrinkage. L1 creates sparse solutions (feature selection) because the gradient doesn't vanish near zero.

Question 4

4 What is Dropout? Why does it prevent overfitting? ðŸ“Š Medium

Answer

Answer: Dropout randomly drops neurons (with probability p) during training. It prevents co-adaptation, forces redundant representations, and acts like ensemble of subnetworks. At test time, weights are scaled by p (inverted dropout).

Question 5

5 Differentiate Dropout and DropConnect. ðŸ”¥ Hard

Answer

Answer: Dropout drops neurons (entire unit). DropConnect drops individual connections (weights) randomly. DropConnect is more fine-grained, but less common.

Question 6

6 Does Batch Normalization regularize? How? ðŸ“Š Medium

Answer

Answer: Yes, BN adds slight regularization because each mini-batch has different mean/variance, adding noise to hidden activations. It reduces the need for dropout. Primary goal: reduce internal covariate shift, enable higher learning rates.

Question 7

7 Compare Batch Norm, Layer Norm, Instance Norm. ðŸ”¥ Hard

Answer

Answer: BN normalizes across batch dimension; LN normalizes across feature dimension (per sample); IN normalizes per channel per sample. LN used in RNNs/Transformers; BN in CNNs with large batch size.

Question 8

8 Why is data augmentation considered regularization? âš¡ Easy

Answer

Answer: It generates new training samples from existing data via transformations (crop, flip, rotation). This increases effective dataset size, reduces overfitting, and improves invariance.

Question 9

9 Explain early stopping as regularization. ðŸ“Š Medium

Answer

Answer: Stop training when validation error stops improving. Prevents overfitting by limiting iterations; equivalent to L2 regularization in some settings. Use patience to avoid premature stop.

Question 10

10 What is label smoothing? Why does it regularize? ðŸ”¥ Hard

Answer

Answer: Replace one-hot labels (1,0) with smoothed values (1-Îµ, Îµ/(K-1)). Prevents overconfidence, improves calibration, and reduces overfitting. Used in modern classifiers (e.g., Transformer).

Question 11

11 How does adding noise (input or weights) regularize? ðŸ“Š Medium

Answer

Answer: Gaussian noise added to inputs or weights makes the model robust to small variations, equivalent to a form of Tikhonov regularization. Denoising autoencoders use this.

Question 12

12 What is max-norm regularization? Where is it used? ðŸ”¥ Hard

Answer

Answer: Constrain ||w||â‚‚ â‰¤ c. After each update, project weights back to satisfy norm constraint. Used with dropout (Hinton et al.) to prevent weights from growing too large.

Question 13

13 How to choose dropout probability? Heuristics. ðŸ“Š Medium

Answer

Answer: Typical p=0.5 for large fully connected layers, p=0.2-0.3 for smaller layers or CNNs. Tune via validation. Too high = underfitting; too low = no regularization.

Question 14

14 Is weight decay in Adam same as L2? (AdamW) ðŸ”¥ Hard

Answer

Answer: In SGD, L2 = weight decay. In Adam, naive L2 is different because adaptive LR interacts with penalty. AdamW decouples weight decay from gradient updates, often performs better.

Question 15

15 What is stochastic depth regularization? ðŸ”¥ Hard

Answer

Answer: Randomly skip entire residual blocks during training. Shortens the network, improves gradient flow, acts as ensemble. Used in ResNets.

Question 16

16 Explain Cutout, Mixup, and CutMix regularization. ðŸ”¥ Hard

Answer

Answer: Cutout: erase random square region. Mixup: convex combination of images and labels. CutMix: cut-paste region from another image. All improve generalization and robustness.

Question 17

17 Does small batch size have a regularization effect? ðŸ“Š Medium

Answer

Answer: Yes, smaller batches introduce noisier gradient estimates, which can help escape sharp minima and generalize better (empirical). But very small batches may be inefficient.

Question 18

18 Compare early stopping and weight decay. ðŸ“Š Medium

Answer

Answer: Both reduce effective model capacity. Early stopping restricts iterations; weight decay restricts weights. Early stopping â‰ˆ L2 on adaptive grid; often used together.

Question 19

19 Can too much regularization cause underfitting? âš¡ Easy

Answer

Answer: Yes. Excessive regularization (high Î», high dropout, too much augmentation) prevents model from capturing even training patterns, increasing bias â†’ underfitting.

Question 20

20 How does regularization affect bias and variance? ðŸ“Š Medium

Answer

Answer: Regularization increases bias (model becomes simpler) but decreases variance (less sensitive to data). Optimal regularization minimizes total test error = biasÂ² + variance + irreducible error.

Related Deep Learning Links

Deep Learning Regularization: 20 Interview Questions

Regularization â€“ Interview Cheat Sheet

Parameter-based

Layer-based

Data-based

Training-based