Neural Networks 15 Essential Q&A
Interview Prep

Dropout & Regularization — 15 Interview Questions

Random neuron masks, inverted dropout, train vs eval, plus L1/L2 weight decay and how they differ from dropout.

Colored left borders per card; green / amber / red difficulty chips.

Dropout L2 L1 Ensemble
1 What is dropout?Easy
Answer: During training, each activation is kept with probability 1−p and set to zero otherwise—different random mask each step. Reduces co-adaptation of neurons.
2 Dropout at training vs inference.Easy
Answer: Training: apply stochastic mask. Inference: no dropout—use full network. Expectation of output must match; handled by scaling (see inverted dropout or test-time multiply by 1−p).
3 What is inverted dropout?Medium
Answer: Scale kept activations by 1/(1−p) during training so inference needs no extra scaling. Common in frameworks—cleaner eval path.
4 Ensemble interpretation of dropout.Medium
Answer: Training samples many thinned subnets; inference averages over exponentially many such nets—approximated by using the full net with scaled weights. Explains regularizing effect.
5 Where is dropout usually applied?Easy
Answer: After fully connected or sometimes conv layers (less common in modern CNNs); often not on output layer. Transformers use attention dropout on weights/probs.
6 Typical dropout probability p?Easy
Answer: Hidden layers often 0.2–0.5; too high hurts capacity. Tune on validation; some architectures (BN-heavy nets) use less dropout.
7 L2 regularization (weight decay)—effect.Medium
Answer: Penalty λ||w||² encourages smaller weights, smoother functions, less overfitting. With SGD equivalent to shrinking weights each step; AdamW decouples decay properly.
8 L1 vs L2 for neural nets.Medium
Answer: L1 encourages sparsity (many exact zeros with subgradient methods). L2 shrinks all weights smoothly. L2 is default; L1 for feature selection or sparse models.
9 Monte Carlo dropout at test time—why?Hard
Answer: Leave dropout on during inference, average multiple forward passes—approximate predictive uncertainty (Bayesian NN heuristic).
10 Dropout with batch normalization—interaction?Hard
Answer: Order and strength matter; dropout before BN can shift batch statistics. Many modern vision models rely more on BN + data aug than heavy dropout—know it’s architecture-dependent.
11 Spatial dropout in CNNs.Medium
Answer: Drop entire feature maps (channels) instead of individual pixels—stronger structural regularization, avoids correlating adjacent activations.
12 Is label smoothing regularization?Medium
Answer: Yes—softens targets so the model doesn’t become overconfident; acts on the loss, not weights directly.
13 Gaussian noise on inputs as regularization.Easy
Answer: Adds robustness to small input perturbations—related to data augmentation and Tikhonov-style effects in linear models.
14 Stochastic depth / drop path (high level).Hard
Answer: Randomly skip whole residual branches during training—regularizes very deep networks similarly in spirit to dropout but on graph structure.
15 When prefer dropout vs weight decay?Medium
Answer: Often use both lightly. Dropout targets co-adaptation of activations; weight decay shrinks parameters. Large data + BN may need little dropout; small data FC nets benefit more.
State clearly: dropout off at eval unless doing MC dropout.

Quick review checklist

  • Dropout train vs eval; inverted dropout; ensemble view.
  • L1 vs L2; AdamW decoupling mention; spatial dropout.
  • MC dropout; interaction with BN at high level.