Related Deep Learning Links
Learn Loss Functions Deep Learning Tutorial, validate concepts with Loss Functions Deep Learning MCQ Questions, and prepare interviews through Loss Functions Deep Learning Interview Questions and Answers.
Loss Functions
20 Essential Q/A
DL Interview Prep
Deep Learning Loss Functions: 20 Interview Questions
Master MSE, MAE, Binary/ Categorical Cross-Entropy, Hinge, Huber, Contrastive, Triplet, KL Divergence, CTC, and more. When to use, gradient behavior, robustness – concise interview-ready answers.
MSE
Cross-Entropy
MAE
Hinge
KL Div
Huber
1
What is a loss function in deep learning?
âš¡ Easy
Answer: A loss function (cost/objective) quantifies the error between model predictions and true targets. Training minimizes this loss via gradient descent. Choice of loss depends on task: regression (L1, L2), classification (cross-entropy), ranking (hinge), etc.
â„’(Å·, y) : measure of "how wrong" the model is.
2
Compare MSE and MAE. When to use each?
📊 Medium
Answer: MSE = mean( (y-ŷ)² ), MAE = mean( |y-ŷ| ). MSE penalizes large errors more (squared), sensitive to outliers. MAE is robust to outliers. Use MSE when outliers are rare/need to be emphasized; MAE when robustness is needed. MSE gradient magnitude ∠error, MAE gradient constant (±1).
MSE: smooth, convex
sensitive to outliers
3
Why use cross-entropy for classification, not MSE?
🔥 Hard
Answer: Cross-entropy with softmax/sigmoid gives stronger gradients when prediction is wrong. MSE + sigmoid saturates quickly – vanishing gradient. CE is also probabilistic (minimizes KL divergence), directly optimizes log-likelihood. CE is convex in parameters for linear models.
Binary CE: -[y log(p) + (1-y) log(1-p)] vs MSE: (y-p)²
4
Binary vs Categorical Cross-Entropy: difference?
âš¡ Easy
Answer: Binary CE for 2 classes (single sigmoid output). Categorical CE for ≥3 classes (softmax output). For multi-label (multiple binary tasks), use binary CE per output.
5
What is Hinge loss? Where is it used?
📊 Medium
Answer: Hinge: max(0, 1 - y·ŷ) for y ∈ {-1,1}. Used in SVMs and max-margin classifiers. Encourages correct classification with a margin. Not differentiable at margin; subgradient used. Less common in deep nets but used in Siamese nets (contrastive hinge).
L = Σ max(0, 1 - y_i * f(x_i))
6
Explain Huber loss. When is it useful?
🔥 Hard
Answer: Huber loss = MSE for small error, MAE for large error (quadratic near zero, linear otherwise). Smooth, less sensitive to outliers than MSE, differentiable. Used in robust regression (e.g., object detection bounding boxes – Smooth L1 is similar).
# Smooth L1 (similar to Huber)
if |x| < 1: 0.5 * x² else |x| - 0.5
7
KL Divergence vs Cross-Entropy: relation?
🔥 Hard
Answer: Cross-Entropy = H(p,q) = H(p) + KL(p||q). Minimizing cross-entropy is equivalent to minimizing KL divergence if p is fixed (target distribution). In VAEs, we minimize KL(q(z|x) || p(z)) to regularize latent space.
8
What are Contrastive and Triplet losses?
🔥 Hard
Answer: Contrastive: pulls positive pairs together, pushes negative apart (margin). Triplet: anchor, positive, negative; loss = max(0, d(a,p) - d(a,n) + margin). Used in face recognition (FaceNet), siamese networks, self-supervised learning (SimCLR).
9
What is Focal Loss? Where is it used?
🔥 Hard
Answer: Focal loss = -(1-p_t)^γ * log(p_t). Modifies cross-entropy to down-weight easy examples, focus on hard misclassified. Solves class imbalance in object detection (RetinaNet). γ=2 common.
10
What is CTC loss? Why is it useful?
🔥 Hard
Answer: Connectionist Temporal Classification (CTC) aligns input sequences to output sequences without pre-alignment. Used in speech recognition, OCR. It sums probabilities over all possible alignments via dynamic programming.
11
Heuristics: choose L1, L2, or Huber for regression?
📊 Medium
Answer: L2 (MSE): default, but outlier-sensitive. L1 (MAE): robust, but slower convergence. Huber: best of both – quadratic for small errors, linear for large. Smooth L1 used in detectors.
12
Why is cross-entropy always ≥ 0?
📊 Medium
Answer: Cross-entropy = -Σ p(x) log q(x). Since p(x) ≤ 1 and log q(x) ≤ 0 (q(x) ≤ 1), product is negative; with minus sign becomes non-negative. Zero only if predictions exactly match targets.
13
Relation between perplexity and cross-entropy?
📊 Medium
Answer: Perplexity = 2^{H(p,q)} where H is cross-entropy (if using log base 2). It measures how "surprised" the model is. Lower perplexity = better language model.
14
NLL vs Cross-Entropy – same?
âš¡ Easy
Answer: For classification with one-hot targets, categorical cross-entropy = negative log-likelihood. NLL is just -log(p(y|x)). In PyTorch, `CrossEntropyLoss` = LogSoftmax + NLLLoss.
15
What is Dice loss? Where is it used?
🔥 Hard
Answer: Dice = 1 - (2|X∩Y|)/(|X|+|Y|). Differentiable approximation of IoU. Used in medical image segmentation, imbalanced data. Handles pixel-wise class imbalance well.
16
Why use log in cross-entropy loss?
📊 Medium
Answer: Log converts multiplicative probabilities to additive; numerically stable. Also, maximizing likelihood = minimizing negative log-likelihood. Log loss heavily penalizes very wrong confident predictions.
17
Compare gradients of MSE and MAE.
📊 Medium
∂MSE/∂ŷ = 2(ŷ - y) ; ∂MAE/∂ŷ = sign(ŷ - y)
MSE gradient scales with error; MAE gradient magnitude constant ±1. MSE converges faster but outlier-sensitive.
18
Loss function for ordinal regression?
🔥 Hard
Answer: CORAL loss (Cumulative link model) or square of difference in thresholds. Alternatively, treat as regression with rounding, or use MSE/MAE if scale meaningful.
19
What is energy-based loss?
🔥 Hard
Answer: Energy-based models (EBM) assign scalar energy to configurations. Loss designed to push down energy of correct answer, pull up incorrect. Example: contrastive loss, hinge loss for EBM.
20
Designing a custom loss: key requirements?
🔥 Hard
Answer: Must be differentiable (almost everywhere), should correlate with evaluation metric, numerically stable, efficient. Also consider convexity (not strictly required) and gradient behavior.
Example: custom IoU loss, focal loss, Huber.
Loss Functions – Interview Cheat Sheet
Regression
- L2 MSE – sensitive to outliers
- L1 MAE – robust, constant grad
- Huber Robust + smooth
- Smooth L1 Faster than Huber
Classification
- CE Binary / Categorical
- Hinge Max-margin
- Focal Class imbalance
Advanced
- KL VAE, distribution
- Contrastive Siamese nets
- Triplet Face recognition
- CTC Seq alignment
- Dice Segmentation
Outlier robust
- MAE, Huber, Smooth L1
Verdict: "Task dictates loss – regression, classification, ranking, alignment."
20 loss Q/A covered
Backpropogation