General ML

Weight Initialisation

Random, but with exactly the right variance

01 · First principlesWhy the starting point matters at all

Training is local search from a starting point. Before the first gradient step, the initial weights determine two things: whether different neurons can become different, and whether signal survives the trip through depth at all. Both can be ruined before any data is seen, and ruined in ways gradient descent cannot repair. Initialisation is the art of not losing the game on move zero.

02 · First failureAll zeros: symmetry never breaks

The tempting "neutral" start — all weights zero — is the one provably broken choice. If every neuron in a layer has identical incoming weights, each computes the same output, therefore receives the same backpropagated gradient, therefore gets the same update. By induction the neurons stay identical forever: a layer of 4096 units carries exactly one unit's worth of features. Gradient descent preserves the symmetry; it cannot break it.

Hence randomness is not noise-for-its-own-sake. It is the symmetry breaker. The only remaining question is the scale.

03 · Second failureWrong scale compounds exponentially

A depth-n network multiplies the signal by n weight matrices. If each layer multiplies the typical activation magnitude by a factor c, depth delivers cⁿ. There is no gentle failure: c = 0.9 over 50 layers is 0.005 (activations and gradients vanish into numerical dust); c = 1.1 is 117 (they explode). The same geometric compounding applies to gradients on the way back. Only c ≈ 1 survives depth — wrong scale is not slow training, it is exponential decay or blow-up of the learning signal itself (the same product-of-factors arithmetic behind saturating activations).

Per-layer gain compounds geometrically with depth. The only line that survives 50 layers is the flat one.

04 · The fixPreserve variance: Xavier and He

So the design principle writes itself: choose the weight variance so each layer leaves activation variance unchanged. The whole derivation is one line. For a unit y = Σ_i w_ix_i with fan-in n independent inputs and zero-mean weights:

Var(y) = n · Var(w) · Var(x) ⇒ Var(y) = Var(x) requires Var(w) = 1/n

That is Xavier/Glorot initialisation, the right answer when the nonlinearity behaves like the identity near zero (tanh does; the practical form averages fan-in and fan-out, 2/(n_in+n_out), to balance the backward pass too). He initialisation is the same argument with one correction: ReLU zeroes the negative half of a symmetric input, which halves the variance per layer. Compensate by doubling:

Var(w) = 2/n_fan-in (He, for ReLU families)

The lesson generalises beyond the two formulas: every activation function implies its own correction factor, computed by asking what the nonlinearity does to the variance of its input.

05 · Modern contextForgiving nets, but the start still steers

Two architectural inventions made networks far less brittle to all of this. Residual connections give the signal an identity path that bypasses the multiplicative gauntlet, so depth no longer compounds gain by default. Normalisation layers actively rescale activations every forward pass, correcting scale errors instead of letting them compound. Together they turned initialisation from a make-or-break setting into a sane default (He or Xavier, picked by activation) that rarely needs thought.

Yet it still earns a note. The early trajectory — which basin the optimiser drifts toward, how stable the first thousand steps are, whether large-LR warmup survives — is set by the init. Modern recipes still carry deliberate init choices: zero-init of the final layer of each residual block (start the network near the identity), depth-scaled inits in large transformers, and the small-but-not-too-small embedding scales that make warmup behave. The problem was demoted, not solved.

Mental Model

Zeros fail by symmetry: identical neurons get identical gradients and remain clones forever — randomness is the symmetry breaker.
Scale fails geometrically: per-layer gain c gives c^depth; anything but c ≈ 1 destroys signal or gradients exponentially.
The one-line rule: Var(y) = n·Var(w)·Var(x), so set Var(w) = 1/fan-in (Xavier, tanh-like) or 2/fan-in (He, ReLU halves variance).
Residual paths and norm layers made depth forgiving; init now mostly sets the early trajectory rather than feasibility.
New activation? Re-derive the factor: ask what it does to the variance passing through it.