Neural Networks 15 Essential Q&A
Interview Prep

Batch Normalization — 15 Interview Questions

Normalize activations per channel, learnable γ and β, train vs eval behavior, and why batch size matters—plus Layer Norm and Group Norm as contrasts.

Colored left borders per card; green / amber / red difficulty chips.

Mean / Var γ, β Train Running stats
1 What does batch normalization do per feature map?Easy
Answer: For each channel (or neuron), subtract batch mean and divide by batch std (with ε), then apply learnable scale γ and shift β so the network can undo normalization if useful.
ŷ = γ · (x − μ_B) / √(σ²_B + ε) + β
2 What is “internal covariate shift”?Medium
Answer: The original paper’s term: changing distribution of layer inputs as earlier layers update. BN aims to stabilize those distributions. Modern view also stresses smoothing loss landscape and allowing higher learning rates.
3 Training vs inference (eval) in batch norm.Easy
Answer: Train: use current mini-batch μ, σ²; update running_mean and running_var with momentum. Eval: freeze γ, β and use running statistics—not batch stats—so one sample or any batch size behaves consistently.
4 Why is batch norm problematic with batch size 1 or very small batches?Medium
Answer: Batch statistics are noisy or undefined; running estimates misalign. Common fixes: Sync BN across GPUs, larger batch, or switch to Layer/Group Norm.
5 Where is BN typically placed: before or after activation?Medium
Answer: Original paper: before nonlinearity (conv → BN → ReLU). Many modern CNNs use after conv but before ReLU still common; ResNet-style often conv–BN–ReLU. Be consistent within an architecture; know both camps exist.
6 Are γ and β learned?Easy
Answer: Yes—trainable parameters like weights. If optimal, the network can learn identity-like scaling so BN doesn’t hurt representational power.
7 What does momentum on running mean/variance mean?Medium
Answer: Exponential moving average: running = (1−m)·batch + m·running—smooths estimates across iterations so eval stats aren’t dominated by last batch.
8 Batch norm in CNNs—which dimensions are normalized?Medium
Answer: Per channel, aggregate over batch, height, width (N×H×W) to get one μ and σ per channel. Keeps spatial structure within each feature map.
9 Layer Norm vs Batch Norm—key difference.Medium
Answer: LN normalizes across features for each example (independent of batch size). BN uses batch dimension—LN fits RNNs/Transformers and small batches better.
10 Group Norm—in one sentence.Hard
Answer: Split channels into groups, normalize within each group per spatial location—less batch-dependent than BN, useful for small batches in vision.
11 Does BN act like regularization?Medium
Answer: Noisy batch statistics add mild regularization similar to jitter; don’t rely on it instead of dropout/weight decay. Effect shrinks with large batch.
12 Inference batch size different from training—OK?Easy
Answer: Yes—eval uses fixed running stats, so any batch size (including 1) is valid if running stats were estimated well during training.
13 Fine-tuning: freeze BN layers—when?Hard
Answer: Small new dataset: sometimes freeze BN (use pretrained running stats) to avoid bad estimates; or keep BN trainable with small LR. Depends on domain shift and batch size.
14 Interaction with weight decay (L2).Hard
Answer: Debated implementation details (decoupling in AdamW). Conceptually BN changes effective step geometry; use framework defaults and literature for the optimizer pairing you cite.
15 When would you avoid batch norm?Medium
Answer: Very small batches, non-batch settings (online), some GAN setups, or when you need batch-independent norm—prefer LN, GN, or modern alternatives (RMSNorm, etc.).
Always say train vs eval and running statistics—the core BN interview answer.

Quick review checklist

  • Formula; γ, β; train vs eval; running averages.
  • Batch size issues; conv BN axes; LN vs BN.
  • Regularization side effect; when to use GN/LN.