Random, but with exactly the right variance
Training is local search from a starting point. Before the first gradient step, the initial weights determine two things: whether different neurons can become different, and whether signal survives the trip through depth at all. Both can be ruined before any data is seen, and ruined in ways gradient descent cannot repair. Initialisation is the art of not losing the game on move zero.
The tempting "neutral" start — all weights zero — is the one provably broken choice. If every neuron in a layer has identical incoming weights, each computes the same output, therefore receives the same backpropagated gradient, therefore gets the same update. By induction the neurons stay identical forever: a layer of 4096 units carries exactly one unit's worth of features. Gradient descent preserves the symmetry; it cannot break it.
A depth-n network multiplies the signal by n weight matrices. If each layer multiplies the typical activation magnitude by a factor c, depth delivers cn. There is no gentle failure: c = 0.9 over 50 layers is 0.005 (activations and gradients vanish into numerical dust); c = 1.1 is 117 (they explode). The same geometric compounding applies to gradients on the way back. Only c ≈ 1 survives depth — wrong scale is not slow training, it is exponential decay or blow-up of the learning signal itself (the same product-of-factors arithmetic behind saturating activations).
Per-layer gain compounds geometrically with depth. The only line that survives 50 layers is the flat one.
So the design principle writes itself: choose the weight variance so each layer leaves activation variance unchanged. The whole derivation is one line. For a unit y = Σi wixi with fan-in n independent inputs and zero-mean weights:
That is Xavier/Glorot initialisation, the right answer when the nonlinearity behaves like the identity near zero (tanh does; the practical form averages fan-in and fan-out, 2/(nin+nout), to balance the backward pass too). He initialisation is the same argument with one correction: ReLU zeroes the negative half of a symmetric input, which halves the variance per layer. Compensate by doubling:
The lesson generalises beyond the two formulas: every activation function implies its own correction factor, computed by asking what the nonlinearity does to the variance of its input.
Two architectural inventions made networks far less brittle to all of this. Residual connections give the signal an identity path that bypasses the multiplicative gauntlet, so depth no longer compounds gain by default. Normalisation layers actively rescale activations every forward pass, correcting scale errors instead of letting them compound. Together they turned initialisation from a make-or-break setting into a sane default (He or Xavier, picked by activation) that rarely needs thought.
Yet it still earns a note. The early trajectory — which basin the optimiser drifts toward, how stable the first thousand steps are, whether large-LR warmup survives — is set by the init. Modern recipes still carry deliberate init choices: zero-init of the final layer of each residual block (start the network near the identity), depth-scaled inits in large transformers, and the small-but-not-too-small embedding scales that make warmup behave. The problem was demoted, not solved.