Neural Networks 15 Essential Q&A
Interview Prep

Loss Functions — 15 Interview Questions

Empirical risk, MSE vs cross-entropy, softmax pairing, robust losses, and how regularizers enter the objective—what interviewers expect you to connect to gradients.

Colored left borders per card; green / amber / red difficulty chips.

Objective Cross-entropy MSE Regularization
1 What is a loss function in supervised learning?Easy
Answer: A scalar that scores how far model outputs are from targets for one example (or batch). Training minimizes the average loss over the dataset—the empirical risk.
2 Mean squared error (MSE)—definition and typical use.Easy
Answer: Average of squared differences between prediction and target. Common for regression; penalizes large errors heavily. With Gaussian noise assumptions, MSE relates to maximum likelihood.
MSE = (1/n) Σ (ŷ_i − y_i)²
3 Binary cross-entropy in one line.Easy
Answer: For label y ∈ {0,1} and predicted probability p̂, loss encourages p̂ → y. It is the negative log-likelihood of a Bernoulli model—strong gradients when the model is confidently wrong.
4 Multi-class cross-entropy with one-hot targets.Medium
Answer: −Σ_k y_k log p̂_k with one-hot y picks the log-probability of the true class. With softmax outputs, this is standard classification training.
5 Why softmax + cross-entropy together?Medium
Answer: Softmax turns logits into a distribution; CE matches it to labels. The combined gradient w.r.t. logits is often simple (prediction minus target), which is stable and efficient to implement (e.g. log-softmax + NLL).
6 Hinge loss—when does it appear?Medium
Answer: Classic for SVMs: penalizes margin violations. Less common in standard deep classifiers than CE but shows up in contrastive / max-margin formulations.
7 Huber loss vs MSE for regression.Medium
Answer: Behaves like MSE near zero (smooth) and like L1 far out—less sensitive to outliers than pure MSE while staying differentiable in practice (at the join point subgradient).
8 Where does L2 regularization appear in the loss?Easy
Answer: Add λ||w||² (or similar) to the empirical loss so optimization shrinks weights, improving generalization. It is weight decay in the objective (implementation details can differ in AdamW).
9 Why not train directly on classification accuracy?Medium
Answer: Accuracy is piecewise constant in logits—gradient is zero almost everywhere. Differentiable surrogates (CE) provide learning signal.
10 Focal loss—purpose in one sentence.Hard
Answer: Down-weights easy examples so training focuses on hard ones—useful with extreme class imbalance in detection settings.
11 Class imbalance—common loss-side fixes?Medium
Answer: Class weights in CE, resampling, focal loss, or changing the evaluation metric. Mention that rebalancing affects calibration.
12 Label smoothing—what does it change?Hard
Answer: Replace hard one-hot with a mixture with a uniform (or other) distribution so the model is not pushed to infinite confidence. Often improves calibration and regularization.
13 KL divergence as a loss component—when?Hard
Answer: When matching two distributions—e.g. knowledge distillation (student vs teacher softmax), variational objectives, or probabilistic models. It measures extra bits if using q instead of p.
14 Multi-label classification—typical loss?Medium
Answer: Independent sigmoid + binary CE per label (not softmax), because multiple labels can be active at once.
15 How do you pick a loss for a new task?Medium
Answer: Match the output head and probabilistic story: regression → MSE/Huber; exclusive classes → softmax+CE; multi-label → sigmoid+BCE; ranking → pairwise/ranking losses. Align with business metric when possible.
Tie every loss answer to gradients and what is being optimized.

Quick review checklist

  • Empirical risk; MSE vs CE; softmax+CE gradient story.
  • Why not accuracy; multi-label vs multi-class losses.
  • Regularization in objective; label smoothing / focal at high level.