Related Neural Networks Links
Learn Batch Normalization Neural Networks Tutorial, validate concepts with Batch Normalization Neural Networks MCQ Questions, and prepare interviews through Batch Normalization Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Batch Normalization — 15 Interview Questions
Normalize activations per channel, learnable γ and β, train vs eval behavior, and why batch size matters—plus Layer Norm and Group Norm as contrasts.
Colored left borders per card; green / amber / red difficulty chips.
Mean / Var
γ, β
Train
Running stats
1 What does batch normalization do per feature map?Easy
Answer: For each channel (or neuron), subtract batch mean and divide by batch std (with ε), then apply learnable scale γ and shift β so the network can undo normalization if useful.
ŷ = γ · (x − μ_B) / √(σ²_B + ε) + β
2 What is “internal covariate shift�Medium
Answer: The original paper’s term: changing distribution of layer inputs as earlier layers update. BN aims to stabilize those distributions. Modern view also stresses smoothing loss landscape and allowing higher learning rates.
3 Training vs inference (eval) in batch norm.Easy
Answer: Train: use current mini-batch μ, σ²; update running_mean and running_var with momentum. Eval: freeze γ, β and use running statistics—not batch stats—so one sample or any batch size behaves consistently.
4 Why is batch norm problematic with batch size 1 or very small batches?Medium
Answer: Batch statistics are noisy or undefined; running estimates misalign. Common fixes: Sync BN across GPUs, larger batch, or switch to Layer/Group Norm.
5 Where is BN typically placed: before or after activation?Medium
Answer: Original paper: before nonlinearity (conv → BN → ReLU). Many modern CNNs use after conv but before ReLU still common; ResNet-style often conv–BN–ReLU. Be consistent within an architecture; know both camps exist.
6 Are γ and β learned?Easy
Answer: Yes—trainable parameters like weights. If optimal, the network can learn identity-like scaling so BN doesn’t hurt representational power.
7 What does momentum on running mean/variance mean?Medium
Answer: Exponential moving average: running = (1−m)·batch + m·running—smooths estimates across iterations so eval stats aren’t dominated by last batch.
8 Batch norm in CNNs—which dimensions are normalized?Medium
Answer: Per channel, aggregate over batch, height, width (N×H×W) to get one μ and σ per channel. Keeps spatial structure within each feature map.
9 Layer Norm vs Batch Norm—key difference.Medium
Answer: LN normalizes across features for each example (independent of batch size). BN uses batch dimension—LN fits RNNs/Transformers and small batches better.
10 Group Norm—in one sentence.Hard
Answer: Split channels into groups, normalize within each group per spatial location—less batch-dependent than BN, useful for small batches in vision.
11 Does BN act like regularization?Medium
Answer: Noisy batch statistics add mild regularization similar to jitter; don’t rely on it instead of dropout/weight decay. Effect shrinks with large batch.
12 Inference batch size different from training—OK?Easy
Answer: Yes—eval uses fixed running stats, so any batch size (including 1) is valid if running stats were estimated well during training.
13 Fine-tuning: freeze BN layers—when?Hard
Answer: Small new dataset: sometimes freeze BN (use pretrained running stats) to avoid bad estimates; or keep BN trainable with small LR. Depends on domain shift and batch size.
14 Interaction with weight decay (L2).Hard
Answer: Debated implementation details (decoupling in AdamW). Conceptually BN changes effective step geometry; use framework defaults and literature for the optimizer pairing you cite.
15 When would you avoid batch norm?Medium
Answer: Very small batches, non-batch settings (online), some GAN setups, or when you need batch-independent norm—prefer LN, GN, or modern alternatives (RMSNorm, etc.).
Always say train vs eval and running statistics—the core BN interview answer.
Quick review checklist
- Formula; γ, β; train vs eval; running averages.
- Batch size issues; conv BN axes; LN vs BN.
- Regularization side effect; when to use GN/LN.