Weight Initialization

Before training starts, every weight and bias must be set to something. Poor initialization can saturate activations, shrink gradients to near zero, or blow up activations so that loss is NaN on the first step. Good initialization keeps the scale of activations and gradients roughly stable across layers at the startâ€”so optimization has a fighting chance. Modern defaults (Xavier/Glorot for tanh/sigmoid, He for ReLU) encode simple variance rules based on fan-in and fan-out.

Xavier He fan-in PyTorch

Why Initialization Scale Matters

Each layer applies a weighted sum of inputs. If weights are too large, linear pre-activations grow with depth; sigmoids saturate, ReLUs fire aggressively, and gradients can explode or vanish depending on the activationâ€™s derivative. If weights are too small, signals decay layer by layer and the network barely learnsâ€”gradients vanish because downstream units see tiny variations.

Initialization schemes aim for unit variance (order 1) of pre-activations at the beginning of training, under simplifying assumptions about input distribution and linearity. They do not replace training; they put parameters in a reasonable part of parameter space so SGD/Adam can work without extreme learning-rate hacks.

Biases. Often initialized to zero (or small constants). The main subtlety is weights; biases shift decision thresholds and are usually less sensitive.

Xavier / Glorot Initialization

Glorot initialization (often called Xavier) targets layers with linear output followed by symmetric activations like tanh or sigmoid (historically). For a uniform distribution, weights are drawn from a range related to 1/âˆšn where n is fan-in (or a harmonic mean of fan-in and fan-out in the â€œnormalâ€ variant). The goal is to preserve variance of activations forward and variance of gradients backward under linear approximations.

PyTorchâ€™s torch.nn.init.xavier_uniform_ and xavier_normal_ implement these rules. For sigmoid/tanh MLPs without batch norm, Xavier remains a standard teaching referenceâ€”though ReLU-dominated vision models more often use He.

He Initialization (ReLU)

He initialization accounts for ReLU zeroing half the mass (roughly): variance is scaled by 2/fan_in for the common normal/uniform variants so that the expected variance of activations after ReLU stays in a sensible range. This is the default family for many Conv2d and Linear modules in PyTorch (Kaiming uniform/normal).

If you stack many ReLU layers without normalization, He init plus reasonable learning rate is a better starting point than Xavier, which assumed different activation statistics.

Rule of thumb. ReLU/Leaky ReLU family â†’ Kaiming/He; tanh/sigmoid hidden layers â†’ Glorot/Xavier; custom ops â†’ read the paper or match framework defaults for that module type.

What Usually Fails

All zeros for weights: symmetric breaking disappears; hidden units in a layer behave identically and gradients are tiedâ€”learning stalls. Same large constant everywhere: similar symmetry and saturation issues. Unscaled random Normal(0,1) in a 4096-wide layer: enormous pre-activations. Always tie scale to layer width.

For output layers, sometimes small random or zero-final-layer tricks are used in residual networks or special heads; follow established recipes for the architecture you copy.

PyTorch: Defaults and Overrides

Explicit Kaiming & Xavier

import torch
import torch.nn as nn

m = nn.Linear(256, 128)
# Default is often Kaiming uniform for Linear
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)

# Xavier for tanh-style stack
m2 = nn.Linear(256, 128)
nn.init.xavier_uniform_(m2.weight)

When you define custom modules, call init after creating parameters or register a reset method. Transfer learning skips init for loaded weightsâ€”only new heads need it.

Summary

Initialization sets the starting signal/gradient scale; bad scale makes training unstable.
Xavier/Glorot suits symmetric saturating activations; He/Kaiming suits ReLU.
Variance scales with fan-in (and sometimes fan-out) â€” never use unscaled large Gaussians in wide layers.
Use framework defaults unless you have a reason; match nonlinearity argument in Kaiming init.

Previous: Network design Next: Batch normalization

Related Neural Networks Links