Same normalisation, three choices of axis
Deep nets are products of layers, so activation scale compounds with depth — bad initialisation or a few large updates and the distributions inside the network drift, saturating activations and destabilising training. A norm layer re-standardises activations every forward pass: subtract a mean, divide by a standard deviation, then restore expressiveness with learned scale and shift.
The original story was "internal covariate shift": each layer's input distribution keeps moving, so each layer chases a moving target, and pinning the distribution fixes that. The story is intuitive and, as stated, mostly wrong — later work showed BatchNorm helps even when the shift is deliberately re-injected. The measured benefit is geometric: normalisation makes the loss landscape smoother (smaller, more stable gradients; better effective Lipschitz constants), which permits much higher learning rates, faster convergence, and indifference to initialisation. Wrong story, right layer.
All three norms compute the same μ-and-σ recipe. The entire difference is the axis along which the statistics are taken, and that one choice decides where each can be used.
BatchNorm normalises each feature (channel) using the mean and variance over the examples in the mini-batch. That choice has teeth:
Where the batch is large and examples are exchangeable — convnet image classification (CNNs) — BatchNorm remains excellent, and its batch coupling even acts as a mild regulariser.
LayerNorm flips the axis: for each individual example (each token, in a transformer), normalise across its feature dimension. Every property that hurt BatchNorm inverts: no dependence on batch size, identical behaviour in training and inference, and perfectly defined for a single token in a sequence of any length. That batch-independence — not any subtlety — is why transformers use LayerNorm.
RMSNorm is LayerNorm minus the mean: divide by the root-mean-square of the features and skip both μ and β.
One fewer reduction, fewer parameters, measurably faster — and in practice it trains as well as LayerNorm, suggesting the re-scaling was doing nearly all the work and the re-centring was ballast. Llama-family and most recent LLMs use RMSNorm.
The whole difference is the highlighted axis. BatchNorm couples examples vertically; LayerNorm (and RMSNorm) stays inside one row.
| Norm | Statistics over | Batch-dependent? | Train = test? | Lives in |
|---|---|---|---|---|
| BatchNorm | batch, per channel | yes | no (running stats) | convnets, large-batch vision |
| LayerNorm | features, per token | no | yes | transformers, RNNs |
| RMSNorm | features, per token, no mean | no | yes | modern LLMs (Llama et al.) |