Reshaping the loss bowl before descending it
For linear-ish models the curvature of the loss is inherited directly from the inputs (for least squares, the Hessian is literally XTX). Feed in one feature measured in metres and another in millimetres, or two features that are near-copies of each other, and the loss surface stops being a round bowl: it becomes a long, narrow, tilted valley. Gradient descent suffers in exactly this geometry, because the gradient points perpendicular to the contours, not at the minimum. The learning rate must be small enough for the steep direction, which condemns the shallow direction to crawl — the step count scales with the ratio between them (the condition number). The result is the famous zig-zag: bouncing between the valley walls while sliding slowly along the floor.
Same problem, two coordinate systems. On round contours the negative gradient aims straight at the optimum from anywhere.
The fix has two stages of increasing ambition. Standardisation handles scale only: per feature, subtract the mean, divide by the standard deviation. Every feature now has variance one. This fixes the metres-versus-millimetres axis stretching and is, in practice, the 90% solution — cheap, robust, and the default preprocessing for anything gradient-trained.
But standardisation treats features independently, so it cannot fix the tilt: two correlated features still produce diagonal valleys. Whitening goes further and removes the correlations too, using the full covariance matrix Σ:
Identity covariance: every direction has unit variance, no direction is correlated with any other. The loss bowl, to the extent the inputs control it, becomes round — the right panel above. (This is also the sense in which whitening is a poor man's Newton's method: instead of preconditioning the update by H−1, precondition the data so H is closer to I.)
Σ−1/2 is not unique — any rotation of a whitened variable is still white, so there is a family of valid transforms. Two members matter. With eigendecomposition Σ = UΛUT:
PCA whitening rotates into the eigenbasis and rescales: the output axes are principal components, ordered by variance, convenient for dropping low-variance dimensions while you are there. ZCA whitening applies the extra rotation U to come back: among all whitening transforms it is the one closest to the identity, so the output still looks like the input (a ZCA-whitened image is still recognisably the image, sharpened; a PCA-whitened one is scrambled coefficients). Use PCA flavour for dimensionality reduction pipelines, ZCA when downstream processing expects data in the original coordinates.
That last limitation explains the modern resolution. Instead of whitening the inputs once, normalisation layers re-standardise activations at every layer, every forward pass — a diagonal (no decorrelation), running-statistics, online approximation of this note, applied where the conditioning problem actually lives. Full whitening survives where it is cheap and the data is low-dimensional: classical pipelines, PCA preprocessing, and as the conceptual ancestor of every trick — from Adam's per-parameter scaling to norm layers — that makes the loss surface rounder instead of making the optimiser smarter.