Dropout & Regularization â€” 15 Interview Questions

Random neuron masks, inverted dropout, train vs eval, plus L1/L2 weight decay and how they differ from dropout.

Colored left borders per card; green / amber / red difficulty chips.

Dropout L2 L1 Ensemble

1 What is dropout?Easy

Answer: During training, each activation is kept with probability 1âˆ’p and set to zero otherwiseâ€”different random mask each step. Reduces co-adaptation of neurons.

2 Dropout at training vs inference.Easy

Answer: Training: apply stochastic mask. Inference: no dropoutâ€”use full network. Expectation of output must match; handled by scaling (see inverted dropout or test-time multiply by 1âˆ’p).

3 What is inverted dropout?Medium

Answer: Scale kept activations by 1/(1âˆ’p) during training so inference needs no extra scaling. Common in frameworksâ€”cleaner eval path.

4 Ensemble interpretation of dropout.Medium

Answer: Training samples many thinned subnets; inference averages over exponentially many such netsâ€”approximated by using the full net with scaled weights. Explains regularizing effect.

5 Where is dropout usually applied?Easy

Answer: After fully connected or sometimes conv layers (less common in modern CNNs); often not on output layer. Transformers use attention dropout on weights/probs.

6 Typical dropout probability p?Easy

Answer: Hidden layers often 0.2â€“0.5; too high hurts capacity. Tune on validation; some architectures (BN-heavy nets) use less dropout.

7 L2 regularization (weight decay)â€”effect.Medium

Answer: Penalty Î»||w||Â² encourages smaller weights, smoother functions, less overfitting. With SGD equivalent to shrinking weights each step; AdamW decouples decay properly.

8 L1 vs L2 for neural nets.Medium

Answer: L1 encourages sparsity (many exact zeros with subgradient methods). L2 shrinks all weights smoothly. L2 is default; L1 for feature selection or sparse models.

9 Monte Carlo dropout at test timeâ€”why?Hard

Answer: Leave dropout on during inference, average multiple forward passesâ€”approximate predictive uncertainty (Bayesian NN heuristic).

10 Dropout with batch normalizationâ€”interaction?Hard

Answer: Order and strength matter; dropout before BN can shift batch statistics. Many modern vision models rely more on BN + data aug than heavy dropoutâ€”know itâ€™s architecture-dependent.

11 Spatial dropout in CNNs.Medium

Answer: Drop entire feature maps (channels) instead of individual pixelsâ€”stronger structural regularization, avoids correlating adjacent activations.

12 Is label smoothing regularization?Medium

Answer: Yesâ€”softens targets so the model doesnâ€™t become overconfident; acts on the loss, not weights directly.

13 Gaussian noise on inputs as regularization.Easy

Answer: Adds robustness to small input perturbationsâ€”related to data augmentation and Tikhonov-style effects in linear models.

14 Stochastic depth / drop path (high level).Hard

Answer: Randomly skip whole residual branches during trainingâ€”regularizes very deep networks similarly in spirit to dropout but on graph structure.

15 When prefer dropout vs weight decay?Medium

Answer: Often use both lightly. Dropout targets co-adaptation of activations; weight decay shrinks parameters. Large data + BN may need little dropout; small data FC nets benefit more.

State clearly: dropout off at eval unless doing MC dropout.

Quick review checklist

Dropout train vs eval; inverted dropout; ensemble view.
L1 vs L2; AdamW decoupling mention; spatial dropout.
MC dropout; interaction with BN at high level.

Previous: Overfitting Next: Optimizers

Related Neural Networks Links

Dropout & Regularization â€” 15 Interview Questions

Quick review checklist