General ML

Data Whitening

Reshaping the loss bowl before descending it

01 · First principlesWhy input geometry is the optimiser's problem

For linear-ish models the curvature of the loss is inherited directly from the inputs (for least squares, the Hessian is literally X^TX). Feed in one feature measured in metres and another in millimetres, or two features that are near-copies of each other, and the loss surface stops being a round bowl: it becomes a long, narrow, tilted valley. Gradient descent suffers in exactly this geometry, because the gradient points perpendicular to the contours, not at the minimum. The learning rate must be small enough for the steep direction, which condemns the shallow direction to crawl — the step count scales with the ratio between them (the condition number). The result is the famous zig-zag: bouncing between the valley walls while sliding slowly along the floor.

Same problem, two coordinate systems. On round contours the negative gradient aims straight at the optimum from anywhere.

02 · Two operationsStandardise, then decorrelate

The fix has two stages of increasing ambition. Standardisation handles scale only: per feature, subtract the mean, divide by the standard deviation. Every feature now has variance one. This fixes the metres-versus-millimetres axis stretching and is, in practice, the 90% solution — cheap, robust, and the default preprocessing for anything gradient-trained.

But standardisation treats features independently, so it cannot fix the tilt: two correlated features still produce diagonal valleys. Whitening goes further and removes the correlations too, using the full covariance matrix Σ:

x → Σ^−1/2(x − μ) ⇒ Cov(x′) = Σ^−1/2 Σ Σ^−1/2 = I

Identity covariance: every direction has unit variance, no direction is correlated with any other. The loss bowl, to the extent the inputs control it, becomes round — the right panel above. (This is also the sense in which whitening is a poor man's Newton's method: instead of preconditioning the update by H⁻¹, precondition the data so H is closer to I.)

03 · FlavoursPCA vs ZCA whitening

Σ^−1/2 is not unique — any rotation of a whitened variable is still white, so there is a family of valid transforms. Two members matter. With eigendecomposition Σ = UΛU^T:

PCA: x′ = Λ^−1/2U^T(x−μ) | ZCA: x′ = UΛ^−1/2U^T(x−μ)

PCA whitening rotates into the eigenbasis and rescales: the output axes are principal components, ordered by variance, convenient for dropping low-variance dimensions while you are there. ZCA whitening applies the extra rotation U to come back: among all whitening transforms it is the one closest to the identity, so the output still looks like the input (a ZCA-whitened image is still recognisably the image, sharpened; a PCA-whitened one is scrambled coefficients). Use PCA flavour for dimensionality reduction pipelines, ZCA when downstream processing expects data in the original coordinates.

04 · The billWhat whitening costs

Estimating Σ is the hard part. A d×d covariance has ~d²/2 free entries; with few samples or large d the estimate is noisy or outright singular, and Σ^−1/2 amplifies the smallest (worst-estimated) eigenvalues — the ε in (Λ+εI)^−1/2 is doing load-bearing work. Shrinkage estimators exist for exactly this.
Outliers poison it twice: once in μ and Σ (both are means, maximally outlier-sensitive), then again when the transform stretches the directions the outlier inflated.
It is fitted preprocessing, so it must be fit on training data only and inside each CV fold — whitening with statistics from the full dataset is textbook leakage (cross-validation, cardinal sins).
And it only straightens the geometry the inputs create; a deep nonlinear network manufactures fresh ill-conditioning internally, which input whitening cannot reach.

05 · Modern noteThe cheap online version won

That last limitation explains the modern resolution. Instead of whitening the inputs once, normalisation layers re-standardise activations at every layer, every forward pass — a diagonal (no decorrelation), running-statistics, online approximation of this note, applied where the conditioning problem actually lives. Full whitening survives where it is cheap and the data is low-dimensional: classical pipelines, PCA preprocessing, and as the conceptual ancestor of every trick — from Adam's per-parameter scaling to norm layers — that makes the loss surface rounder instead of making the optimiser smarter.

Mental Model

Bad input geometry (scales, correlations) becomes bad loss geometry; gradient descent zig-zags across narrow tilted valleys.
Standardisation fixes scale per feature; whitening also rotates away correlation: x → Σ⁻¹ᐟ²(x−μ), covariance = I.
PCA whitening lands in component space (good for truncation); ZCA returns to original coordinates (data still looks like itself).
The costs: estimating Σ in high dimension, outlier sensitivity, and strict fit-inside-the-fold hygiene.
Norm layers are the cheap online descendant — re-standardising every layer beat whitening the inputs once.