Batch Normalization â€” 15 Interview Questions

Normalize activations per channel, learnable Î³ and Î², train vs eval behavior, and why batch size mattersâ€”plus Layer Norm and Group Norm as contrasts.

Colored left borders per card; green / amber / red difficulty chips.

Mean / Var Î³, Î² Train Running stats

1 What does batch normalization do per feature map?Easy

Answer: For each channel (or neuron), subtract batch mean and divide by batch std (with Îµ), then apply learnable scale Î³ and shift Î² so the network can undo normalization if useful.

Å· = Î³ Â· (x âˆ’ Î¼_B) / âˆš(ÏƒÂ²_B + Îµ) + Î²

2 What is â€œinternal covariate shiftâ€?Medium

Answer: The original paperâ€™s term: changing distribution of layer inputs as earlier layers update. BN aims to stabilize those distributions. Modern view also stresses smoothing loss landscape and allowing higher learning rates.

3 Training vs inference (eval) in batch norm.Easy

Answer: Train: use current mini-batch Î¼, ÏƒÂ²; update running_mean and running_var with momentum. Eval: freeze Î³, Î² and use running statisticsâ€”not batch statsâ€”so one sample or any batch size behaves consistently.

4 Why is batch norm problematic with batch size 1 or very small batches?Medium

Answer: Batch statistics are noisy or undefined; running estimates misalign. Common fixes: Sync BN across GPUs, larger batch, or switch to Layer/Group Norm.

5 Where is BN typically placed: before or after activation?Medium

Answer: Original paper: before nonlinearity (conv â†’ BN â†’ ReLU). Many modern CNNs use after conv but before ReLU still common; ResNet-style often convâ€“BNâ€“ReLU. Be consistent within an architecture; know both camps exist.

6 Are Î³ and Î² learned?Easy

Answer: Yesâ€”trainable parameters like weights. If optimal, the network can learn identity-like scaling so BN doesnâ€™t hurt representational power.

7 What does momentum on running mean/variance mean?Medium

Answer: Exponential moving average: running = (1âˆ’m)Â·batch + mÂ·runningâ€”smooths estimates across iterations so eval stats arenâ€™t dominated by last batch.

8 Batch norm in CNNsâ€”which dimensions are normalized?Medium

Answer: Per channel, aggregate over batch, height, width (NÃ—HÃ—W) to get one Î¼ and Ïƒ per channel. Keeps spatial structure within each feature map.

9 Layer Norm vs Batch Normâ€”key difference.Medium

Answer: LN normalizes across features for each example (independent of batch size). BN uses batch dimensionâ€”LN fits RNNs/Transformers and small batches better.

10 Group Normâ€”in one sentence.Hard

Answer: Split channels into groups, normalize within each group per spatial locationâ€”less batch-dependent than BN, useful for small batches in vision.

11 Does BN act like regularization?Medium

Answer: Noisy batch statistics add mild regularization similar to jitter; donâ€™t rely on it instead of dropout/weight decay. Effect shrinks with large batch.

12 Inference batch size different from trainingâ€”OK?Easy

Answer: Yesâ€”eval uses fixed running stats, so any batch size (including 1) is valid if running stats were estimated well during training.

13 Fine-tuning: freeze BN layersâ€”when?Hard

Answer: Small new dataset: sometimes freeze BN (use pretrained running stats) to avoid bad estimates; or keep BN trainable with small LR. Depends on domain shift and batch size.

14 Interaction with weight decay (L2).Hard

Answer: Debated implementation details (decoupling in AdamW). Conceptually BN changes effective step geometry; use framework defaults and literature for the optimizer pairing you cite.

15 When would you avoid batch norm?Medium

Answer: Very small batches, non-batch settings (online), some GAN setups, or when you need batch-independent normâ€”prefer LN, GN, or modern alternatives (RMSNorm, etc.).

Always say train vs eval and running statisticsâ€”the core BN interview answer.

Quick review checklist

Formula; Î³, Î²; train vs eval; running averages.
Batch size issues; conv BN axes; LN vs BN.
Regularization side effect; when to use GN/LN.

Previous: Weight initialization Next: Overfitting

Related Neural Networks Links

Batch Normalization â€” 15 Interview Questions

Quick review checklist