How spread a variable is, and whether two variables move together
The expectation collapses a distribution to one point. Two distributions can share a mean and behave nothing alike — a thermostat at 20°C and a desert that averages 20°C. We want the typical distance from the mean. The naive candidate E[X − μ] is useless (it is identically zero; deviations cancel). So we square first:
Why squaring rather than the absolute value? Partly honesty — squared deviation is what shows up when MSE-style losses decompose — but mostly tractability: squares differentiate cleanly, expand algebraically, and make independent contributions add (section 04). The price is that variance lives in squared units and is sensitive to outliers; the standard deviation σ restores the original units.
The one identity worth memorising, three lines from linearity:
Read it as: spread = raw second moment minus what the mean already explains. Since variance cannot be negative, E[X²] ≥ E[X]² always — the simplest instance of Jensen's inequality, and the reason a squared average understates an average of squares.
Now ask the two-variable question: when X sits above its mean, does Y tend to sit above its mean too? Multiply the two deviations and average:
Positive when the deviations agree in sign, negative when they oppose, zero when they are unrelated linearly. Covariance has awkward units (X-units times Y-units), so we normalise by both spreads:
Variance is not linear; the cross-term is exactly where covariance enters:
When X and Y are independent the cross-term dies and variances add. This single fact runs a surprising amount of ML bookkeeping: the variance of a mean of n i.i.d. samples is σ²/n (why averaging works, why error bars shrink like 1/√n), and initialisation schemes such as Xavier and He are nothing but variance accounting — choose weight variance ∝ 1/fan-in so that the variance of activations neither explodes nor vanishes as it compounds through layers.
For a random vector x ∈ ℝd, collect every pairwise covariance into a matrix: Σij = Cov(xi, xj). Variances run down the diagonal; co-movements fill the rest. Geometrically, Σ describes the ellipsoid the data cloud fills — its eigenvectors are the axes of the ellipse, its eigenvalues the variances along each axis.
The covariance matrix drawn as geometry: eigenvectors give the ellipse axes, eigenvalues the spread along each.
This is the doorway to PCA, which simply rotates coordinates onto those eigenvectors so that Σ becomes diagonal — all co-movement converted into axis-aligned variance, ready to be ranked and truncated.