Loss Functions: The Compass of Neural Networks

Loss functions quantify the error between predictions and true targets. They guide gradient descent and define what the model optimizes. From regression to classification, from standard to custom losses â€” complete reference with math and code.

Regression

MSE, MAE, Huber

Classification

Cross-Entropy, Hinge

Probabilistic

KL Divergence

Advanced

CTC, Contrastive

What is a Loss Function?

A loss function (also called cost/objective function) maps the model's predictions and ground truth to a scalar value. Lower loss = better predictions. During training, backpropagation computes gradients of the loss w.r.t. weights, and optimizers update weights to minimize this loss.

Predictions (Å·) + True Targets (y) â†’ Loss = L(Å·, y) â†’ Gradient âˆ‡L

Loss functions define the learning objective.

Regression Losses: Predicting Continuous Values

MSE (L2 Loss)

MSE = 1/n Î£(y - Å·)Â²

import numpy as np
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

Differentiable Sensitive to outliers Most common regression loss.

MAE (L1 Loss)

MAE = 1/n Î£|y - Å·|

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

Robust to outliers Not differentiable at 0.

Huber Loss

LÎ´ = { Â½(y-Å·)Â² for \|y-Å·\|â‰¤Î´, else Î´\|y-Å·\| - Â½Î´Â² }

def huber(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    return np.mean(is_small * 0.5 * error**2 + 
                   ~is_small * (delta*np.abs(error) - 0.5*delta**2))

Combines MSE and MAE. Smooth, robust.

Log-Cosh Loss

L = log(cosh(Å· - y))

Smooth approximation to MAE, twice differentiable.

Quantile Loss

L = Î£ max(q(y-Å·), (q-1)(y-Å·))

Used for predicting prediction intervals.

Classification Losses: Probability & Decision Boundaries

Binary Cross-Entropy (BCE)

BCE = -[y log(Å·) + (1-y) log(1-Å·)]

def binary_crossentropy(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1-1e-7)  # stability
    return -np.mean(y * np.log(y_pred) + 
                    (1 - y) * np.log(1 - y_pred))

Use: Binary classification, sigmoid output.

Categorical Cross-Entropy (CCE)

CCE = -Î£ y_i log(Å·_i)

def categorical_crossentropy(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1.0)
    return -np.sum(y * np.log(y_pred)) / y.shape[0]

Use: Multi-class, softmax output.

Sparse CCE

Same as CCE but targets are integers (not one-hot). Memory efficient.

tf.keras.losses.SparseCategoricalCrossentropy()

Hinge Loss

L = max(0, 1 - yÂ·Å·) (y âˆˆ {-1,1})

Used in SVMs, also with CNNs.

tf.keras.losses.Hinge()

Squared Hinge

L = max(0, 1 - yÂ·Å·)Â²

Differentiable, penalizes errors more.

Numerical stability: Always use framework implementations (e.g., tf.keras.losses.BinaryCrossentropy(from_logits=True)) which combine log and softmax/sigmoid in a numerically stable way.

Probabilistic Losses: Distributions & Divergence

KL Divergence

D_KL(P||Q) = Î£ P(i) log(P(i)/Q(i))

Measures how one probability distribution diverges from another. Asymmetric.

def kl_divergence(p, q):
    p = np.clip(p, 1e-7, 1)
    q = np.clip(q, 1e-7, 1)
    return np.sum(p * np.log(p / q))

Used in VAEs, variational inference.

JS Divergence

Jensen-Shannon divergence. Symmetric, smoothed version of KL.

Used in GANs, domain adaptation.

Cross-Entropy vs KL

Cross-Entropy = H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL when P is fixed (ground truth).

Advanced & Specialized Loss Functions

CTC Loss

Connectionist Temporal Classification. Used in speech recognition, handwriting recognition. Aligns sequences without alignment labels.

tf.nn.ctc_loss

Contrastive Loss

L = y*dÂ² + (1-y)*max(margin-d,0)Â²

Used in Siamese networks, similarity learning.

Triplet Loss

max(d(a,p)-d(a,n)+margin, 0)

Face recognition (FaceNet), embeddings.

Dice Loss / F1 Score

Perceptual Loss

Loss based on feature maps of pre-trained networks (VGG). For style transfer, super-resolution.

Loss Function Selection Guide

Task Type	Recommended Loss	Output Activation	Comments
Regression (normal)	MSE	Linear	Sensitive to outliers
Regression (robust)	Huber / MAE	Linear	Less sensitive to outliers
Binary Classification	Binary Cross-Entropy	Sigmoid	Use from_logits for stability
Multi-class Classification	Categorical Cross-Entropy	Softmax	Use sparse CE for integer labels
Multi-label Classification	Binary Cross-Entropy	Sigmoid (per class)	Independent probabilities
Imbalanced Data	Weighted CE / Focal Loss	Sigmoid/Softmax	Focuses on hard samples
Similarity Learning	Contrastive / Triplet	L2 normalized	Embedding space
Generative Models	BCE (GANs), KL (VAEs)	Varies	Task specific

                     Quick Selection Rules:
                    Regression: Start with MSE. If outliers are problematic, try MAE or Huber.
Binary classification: Binary cross-entropy.
Multi-class: Categorical cross-entropy.
Probabilistic outputs: KL Divergence.
Sequence alignment: CTC Loss.

                

Loss Functions in TensorFlow & PyTorch

TensorFlow / Keras

import tensorflow as tf

# Common losses
model.compile(loss='mse', optimizer='adam')  # regression
model.compile(loss='binary_crossentropy', ...)
model.compile(loss='categorical_crossentropy', ...)
model.compile(loss=tf.keras.losses.Huber(delta=1.5), ...)

# Custom loss function
def custom_mse(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))

PyTorch

import torch.nn as nn

criterion = nn.MSELoss()  # regression
criterion = nn.BCELoss()  # requires sigmoid
criterion = nn.BCEWithLogitsLoss()  # stable, from_logits
criterion = nn.CrossEntropyLoss()  # includes softmax
criterion = nn.KLDivLoss()  # KL divergence

# Custom
class CustomLoss(nn.Module):
    def forward(self, y_pred, y_true):
        return torch.mean((y_true - y_pred)**2)

Designing Custom Loss Functions

Sometimes you need a task-specific loss. Any differentiable function that maps (y_true, y_pred) to a scalar can be a loss.

Custom Loss in TensorFlow

def weighted_mse(y_true, y_pred):
    weights = tf.where(y_true > 0.5, 2.0, 1.0)
    return tf.reduce_mean(weights * (y_true - y_pred)**2)

model.compile(loss=weighted_mse, optimizer='adam')

Custom Loss in PyTorch

class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, y_pred, y_true):
        bce = nn.functional.binary_cross_entropy_with_logits(y_pred, y_true, reduction='none')
        p = torch.sigmoid(y_pred)
        focal = self.alpha * (1-p)**self.gamma * bce
        return focal.mean()

Tip: Ensure your custom loss is differentiable and numerically stable. Test with small inputs.

Loss Function Pitfalls & Best Practices

âš ï¸ Wrong loss for task: Using MSE for classification leads to poor convergence and probability estimates.

âš ï¸ Ignoring class imbalance: Use weighted cross-entropy or focal loss.

âœ… Numerical stability: Use `from_logits=True` or `BCEWithLogitsLoss` to avoid log(0).

âœ… Monitor loss curves: Loss decreasing? Plateau? NaN? Helps debug.

Loss Landscape: The shape of loss function affects optimization. MSE is convex, Cross-Entropy is convex for linear models, neural nets are non-convex.

Loss Functions CheatsheetMSE Regression
MAE Robust reg.
Huber Smooth robust
BCE Binary cls
CCE Multi-class
KL Divergence
Hinge Max-margin
CTC Sequence

Next: Backpropagation

Related Deep Learning Links

Loss Functions: The Compass of Neural Networks

Regression

Classification

Probabilistic

Advanced

What is a Loss Function?

Regression Losses: Predicting Continuous Values

MSE (L2 Loss)

MAE (L1 Loss)

Huber Loss

Log-Cosh Loss

Quantile Loss

Classification Losses: Probability & Decision Boundaries

Binary Cross-Entropy (BCE)

Categorical Cross-Entropy (CCE)

Sparse CCE

Hinge Loss

Squared Hinge

Probabilistic Losses: Distributions & Divergence

KL Divergence

JS Divergence

Cross-Entropy vs KL

Advanced & Specialized Loss Functions

CTC Loss

Contrastive Loss

Triplet Loss

Dice Loss / F1 Score

Perceptual Loss

Loss Function Selection Guide

Quick Selection Rules:

Loss Functions in TensorFlow & PyTorch

TensorFlow / Keras

PyTorch

Designing Custom Loss Functions

Loss Function Pitfalls & Best Practices

Loss Functions Cheatsheet