General ML

Variance and Covariance

How spread a variable is, and whether two variables move together

01 · First principlesThe mean says where; what says how wide?

The expectation collapses a distribution to one point. Two distributions can share a mean and behave nothing alike — a thermostat at 20°C and a desert that averages 20°C. We want the typical distance from the mean. The naive candidate E[X − μ] is useless (it is identically zero; deviations cancel). So we square first:

Var(X) = E[(X − μ)²], μ = E[X] σ = √Var(X)

Why squaring rather than the absolute value? Partly honesty — squared deviation is what shows up when MSE-style losses decompose — but mostly tractability: squares differentiate cleanly, expand algebraically, and make independent contributions add (section 04). The price is that variance lives in squared units and is sensitive to outliers; the standard deviation σ restores the original units.

02 · The identityVar(X) = E[X²] − E[X]²

The one identity worth memorising, three lines from linearity:

Var(X) = E[(X − μ)²] = E[X² − 2μX + μ²]
= E[X²] − 2μ·E[X] + μ² (linearity; μ is a constant)
= E[X²] − E[X]²

Read it as: spread = raw second moment minus what the mean already explains. Since variance cannot be negative, E[X²] ≥ E[X]² always — the simplest instance of Jensen's inequality, and the reason a squared average understates an average of squares.

03 · Two variablesCovariance and correlation

Now ask the two-variable question: when X sits above its mean, does Y tend to sit above its mean too? Multiply the two deviations and average:

Cov(X, Y) = E[(X − μ_X)(Y − μ_Y)] = E[XY] − E[X]E[Y]

Positive when the deviations agree in sign, negative when they oppose, zero when they are unrelated linearly. Covariance has awkward units (X-units times Y-units), so we normalise by both spreads:

ρ = Cov(X, Y) / (σ_X σ_Y) ∈ [−1, 1]

Two standing warnings. Independence ⇒ Cov = 0, but Cov = 0 does not imply independence (Y = X² with symmetric X has zero covariance and total dependence). And correlation measures co-movement, not causation — it cannot tell you which way the arrow points, or whether there is one.

04 · The algebraVariance of a sum

Variance is not linear; the cross-term is exactly where covariance enters:

Var(aX + bY) = a²Var(X) + b²Var(Y) + 2ab·Cov(X, Y)

When X and Y are independent the cross-term dies and variances add. This single fact runs a surprising amount of ML bookkeeping: the variance of a mean of n i.i.d. samples is σ²/n (why averaging works, why error bars shrink like 1/√n), and initialisation schemes such as Xavier and He are nothing but variance accounting — choose weight variance ∝ 1/fan-in so that the variance of activations neither explodes nor vanishes as it compounds through layers.

05 · Many variablesThe covariance matrix is the shape of the cloud

For a random vector x ∈ ℝ^d, collect every pairwise covariance into a matrix: Σ_ij = Cov(x_i, x_j). Variances run down the diagonal; co-movements fill the rest. Geometrically, Σ describes the ellipsoid the data cloud fills — its eigenvectors are the axes of the ellipse, its eigenvalues the variances along each axis.

The covariance matrix drawn as geometry: eigenvectors give the ellipse axes, eigenvalues the spread along each.

This is the doorway to PCA, which simply rotates coordinates onto those eigenvectors so that Σ becomes diagonal — all co-movement converted into axis-aligned variance, ready to be ranked and truncated.

Mental Model

Variance = expected squared distance from the mean; σ restores the units.
Var = E[X²] − E[X]²: second moment minus what the mean explains.
Covariance asks one question: do deviations agree in sign? Correlation is the same answer, normalised to [−1, 1].
Variances add only when the cross-term dies — independence is what makes σ²/n averaging and init schemes work.
Σ is the shape of the data cloud; PCA is the rotation that makes it axis-aligned.