General ML

Expectation

The long-run average, and why every loss function is one

01 · First principlesWhat number summarises a random quantity?

A random variable X takes many values with many probabilities. If you could only keep one number to stand for it, which one? The natural candidate: weight each value by how often it occurs.

E[X] = Σx x·p(x)   (discrete)      E[X] = ∫ x·p(x) dx   (continuous)

The probability-weighted average is also the long-run value: play the game n times, average the outcomes, and as n grows the empirical average converges to E[X] (the law of large numbers). A casino does not know what your next roll pays; it knows exactly what a million rolls pay per roll. Expectation is the casino's view of randomness.

One caution at the door: the expectation is the centre of mass, not a typical outcome. A lottery ticket with E = −$0.50 never actually pays −$0.50.

02 · The superpowerLinearity, with no fine print

Almost every property in probability comes with conditions. Linearity of expectation comes with none:

E[aX + bY] = a·E[X] + b·E[Y]   — always, even when X and Y are dependent

This is the fact everyone forgets. Variances only add under independence; expectations add unconditionally, because integration is linear and dependence lives in the joint density, which the sum never has to inspect. We can verify in two lines:

E[X + Y] = ∫∫ (x + y) p(x, y) dx dy
           = ∫∫ x·p(x,y) dx dy + ∫∫ y·p(x,y) dx dy = E[X] + E[Y]

The trick in practice: decompose a hard random quantity into a sum of easy indicator variables, take expectations one by one, and never once think about how the pieces interact. (Counting expected collisions in a hash table, expected triangles in a random graph — all the same move.)

What linearity does not give you: E[f(X)] ≠ f(E[X]) for nonlinear f. The gap is systematic — see Jensen's inequality at work in KL divergence.

03 · The workhorseEvery loss is an expectation

What we actually want to minimise in ML is the risk — expected loss over the true data distribution:

L(θ) = E(x,y)∼pdata[ ℓ(fθ(x), y) ]

We cannot compute this expectation (we do not have pdata), so we estimate it with samples:

L(θ) ≈ (1/n) Σi=1n ℓ(fθ(xi), yi)   — Monte Carlo; LLN says this → L(θ)

That is the entire justification for training on a dataset, in one line. A mini-batch gradient is the same move applied to a gradient: an unbiased Monte Carlo estimate of ∇θE[ℓ]. SGD works because the expectation of the noisy gradient is the true gradient — linearity again, doing quiet load-bearing work.

04 · Conditional expectationThe best possible prediction

Suppose you observe X and must predict Y with some function g(X), scored by mean squared error. Which g is optimal? Not a modelling choice — a theorem:

g*(x) = E[Y | X = x]   minimises   E[(Y − g(X))²]

The key step: fix x, write c = g(x), and expand around the conditional mean μ = E[Y|x]:

E[(Y − c)² | x] = E[(Y − μ)² | x] + (μ − c)²
           ⇒ minimised at c = μ, leaving the irreducible Var(Y|x)

So every regression model is an attempt to approximate E[Y|X], and the leftover Var(Y|x) is exactly the noise floor in the bias–variance decomposition. (Swap MSE for absolute error and the answer becomes the conditional median; the loss chooses the summary.)

05 · Where intuition breaksThe mean is not the message

Two standard failure modes. First, for skewed distributions the mean sits far from where the mass is — income, token frequencies, loss spikes. Second, some distributions have no expectation at all: the Cauchy integral ∫ x·p(x) dx diverges, and sample averages never settle, no matter how many samples you take.

x → MEDIAN MEAN E[X] heavy tail pulls the mean

A right-skewed density. The mean is dragged toward the tail; most samples land left of it.

Practical reading: when you report an average metric, ask whether the distribution behind it is skewed or heavy-tailed. Mean latency and p99 latency are different facts about the same system.
Mental Model