Loss Functions: The Compass of Neural Networks
Loss functions quantify the error between predictions and true targets. They guide gradient descent and define what the model optimizes. From regression to classification, from standard to custom losses — complete reference with math and code.
Regression
MSE, MAE, Huber
Classification
Cross-Entropy, Hinge
Probabilistic
KL Divergence
Advanced
CTC, Contrastive
What is a Loss Function?
A loss function (also called cost/objective function) maps the model's predictions and ground truth to a scalar value. Lower loss = better predictions. During training, backpropagation computes gradients of the loss w.r.t. weights, and optimizers update weights to minimize this loss.
Loss functions define the learning objective.
Regression Losses: Predicting Continuous Values
MSE (L2 Loss)
MSE = 1/n Σ(y - ŷ)²
import numpy as np
def mse(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
Differentiable Sensitive to outliers Most common regression loss.
MAE (L1 Loss)
MAE = 1/n Σ|y - ŷ|
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
Robust to outliers Not differentiable at 0.
Huber Loss
Lδ = { ½(y-ŷ)² for \|y-ŷ\|≤δ, else δ\|y-ŷ\| - ½δ² }
def huber(y_true, y_pred, delta=1.0):
error = y_true - y_pred
is_small = np.abs(error) <= delta
return np.mean(is_small * 0.5 * error**2 +
~is_small * (delta*np.abs(error) - 0.5*delta**2))
Combines MSE and MAE. Smooth, robust.
Log-Cosh Loss
L = log(cosh(ŷ - y))
Smooth approximation to MAE, twice differentiable.
Quantile Loss
L = Σ max(q(y-ŷ), (q-1)(y-ŷ))
Used for predicting prediction intervals.
Classification Losses: Probability & Decision Boundaries
Binary Cross-Entropy (BCE)
BCE = -[y log(ŷ) + (1-y) log(1-ŷ)]
def binary_crossentropy(y, y_pred):
y_pred = np.clip(y_pred, 1e-7, 1-1e-7) # stability
return -np.mean(y * np.log(y_pred) +
(1 - y) * np.log(1 - y_pred))
Use: Binary classification, sigmoid output.
Categorical Cross-Entropy (CCE)
CCE = -Σ y_i log(ŷ_i)
def categorical_crossentropy(y, y_pred):
y_pred = np.clip(y_pred, 1e-7, 1.0)
return -np.sum(y * np.log(y_pred)) / y.shape[0]
Use: Multi-class, softmax output.
Sparse CCE
Same as CCE but targets are integers (not one-hot). Memory efficient.
tf.keras.losses.SparseCategoricalCrossentropy()
Hinge Loss
L = max(0, 1 - y·ŷ) (y ∈ {-1,1})
Used in SVMs, also with CNNs.
tf.keras.losses.Hinge()
Squared Hinge
L = max(0, 1 - y·ŷ)²
Differentiable, penalizes errors more.
tf.keras.losses.BinaryCrossentropy(from_logits=True)) which combine log and softmax/sigmoid in a numerically stable way.
Probabilistic Losses: Distributions & Divergence
KL Divergence
D_KL(P||Q) = Σ P(i) log(P(i)/Q(i))
Measures how one probability distribution diverges from another. Asymmetric.
def kl_divergence(p, q):
p = np.clip(p, 1e-7, 1)
q = np.clip(q, 1e-7, 1)
return np.sum(p * np.log(p / q))
Used in VAEs, variational inference.
JS Divergence
Jensen-Shannon divergence. Symmetric, smoothed version of KL.
Used in GANs, domain adaptation.
Cross-Entropy vs KL
Cross-Entropy = H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL when P is fixed (ground truth).
Advanced & Specialized Loss Functions
CTC Loss
Connectionist Temporal Classification. Used in speech recognition, handwriting recognition. Aligns sequences without alignment labels.
tf.nn.ctc_loss
Contrastive Loss
L = y*d² + (1-y)*max(margin-d,0)²
Used in Siamese networks, similarity learning.
Triplet Loss
max(d(a,p)-d(a,n)+margin, 0)
Face recognition (FaceNet), embeddings.
Dice Loss / F1 Score
1 - (2|X∩Y|)/(|X|+|Y|). For imbalanced segmentation, medical imaging.
Perceptual Loss
Loss based on feature maps of pre-trained networks (VGG). For style transfer, super-resolution.
Loss Function Selection Guide
| Task Type | Recommended Loss | Output Activation | Comments |
|---|---|---|---|
| Regression (normal) | MSE | Linear | Sensitive to outliers |
| Regression (robust) | Huber / MAE | Linear | Less sensitive to outliers |
| Binary Classification | Binary Cross-Entropy | Sigmoid | Use from_logits for stability |
| Multi-class Classification | Categorical Cross-Entropy | Softmax | Use sparse CE for integer labels |
| Multi-label Classification | Binary Cross-Entropy | Sigmoid (per class) | Independent probabilities |
| Imbalanced Data | Weighted CE / Focal Loss | Sigmoid/Softmax | Focuses on hard samples |
| Similarity Learning | Contrastive / Triplet | L2 normalized | Embedding space |
| Generative Models | BCE (GANs), KL (VAEs) | Varies | Task specific |
Quick Selection Rules:
- Regression: Start with MSE. If outliers are problematic, try MAE or Huber.
- Binary classification: Binary cross-entropy.
- Multi-class: Categorical cross-entropy.
- Probabilistic outputs: KL Divergence.
- Sequence alignment: CTC Loss.
Loss Functions in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
# Common losses
model.compile(loss='mse', optimizer='adam') # regression
model.compile(loss='binary_crossentropy', ...)
model.compile(loss='categorical_crossentropy', ...)
model.compile(loss=tf.keras.losses.Huber(delta=1.5), ...)
# Custom loss function
def custom_mse(y_true, y_pred):
return tf.reduce_mean(tf.square(y_true - y_pred))
PyTorch
import torch.nn as nn
criterion = nn.MSELoss() # regression
criterion = nn.BCELoss() # requires sigmoid
criterion = nn.BCEWithLogitsLoss() # stable, from_logits
criterion = nn.CrossEntropyLoss() # includes softmax
criterion = nn.KLDivLoss() # KL divergence
# Custom
class CustomLoss(nn.Module):
def forward(self, y_pred, y_true):
return torch.mean((y_true - y_pred)**2)
Designing Custom Loss Functions
Sometimes you need a task-specific loss. Any differentiable function that maps (y_true, y_pred) to a scalar can be a loss.
def weighted_mse(y_true, y_pred):
weights = tf.where(y_true > 0.5, 2.0, 1.0)
return tf.reduce_mean(weights * (y_true - y_pred)**2)
model.compile(loss=weighted_mse, optimizer='adam')
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, y_pred, y_true):
bce = nn.functional.binary_cross_entropy_with_logits(y_pred, y_true, reduction='none')
p = torch.sigmoid(y_pred)
focal = self.alpha * (1-p)**self.gamma * bce
return focal.mean()
Loss Function Pitfalls & Best Practices
Loss Landscape: The shape of loss function affects optimization. MSE is convex, Cross-Entropy is convex for linear models, neural nets are non-convex.