General ML

PDF / PMF

Probability mass, probability density, and the trap between them

01 · The easy caseDiscrete: probability sits on points

For a discrete variable, probability is straightforward bookkeeping. The probability mass function assigns each outcome a weight directly:

p(x) = P(X = x),    p(x) ≥ 0,    Σx p(x) = 1

Every number you read off a PMF is a genuine probability. A die: p(3) = 1/6, full stop. Nothing here will surprise you, which is exactly why the continuous case does.

02 · The trapContinuous: P(X = x) = 0, for every x

Now let X be a person's height. What is the probability that someone is exactly 170 cm — to infinitely many decimal places? Zero. There are uncountably many real numbers; if any single point carried positive probability, the total would explode past 1. So for every continuous X and every point x:

P(X = x) = 0   — yet X must land somewhere

The fix is to stop asking about points and start asking about intervals. The probability density function p(x) is defined so that probability is its integral:

P(a ≤ X ≤ b) = ∫ab p(x) dx,    ∫−∞ p(x) dx = 1

A density is probability per unit length — the same relationship a material's density has to its mass. Asking "what is the probability at this point" is asking "what is the mass of this point": zero. Only the integral over a region means anything. The map analogy holds up: population density at one GPS coordinate is well defined, but the population of a coordinate is zero; you must integrate over an area to count people.

03 · ConsequenceDensities can exceed 1

Because p(x) is a rate, not a probability, nothing caps it at 1. The uniform density on [0, 0.1] equals 10 everywhere on its support; a Gaussian with σ = 0.01 peaks near 40. The constraint is only that the area equals 1 — tall is fine if narrow.

PMF · BARS ARE PROBABILITIES PDF · ONLY AREA IS PROBABILITY p(x) can pass 1 ∫ₐᵇ p(x)dx

Left: each bar height is a probability. Right: heights are rates; the shaded area is the probability.

04 · Change of variablesWhere the |det J| comes from

Transform Y = f(X) with f invertible, and the naive guess pY(y) = pX(f−1(y)) is wrong: it forgets that f stretches and compresses space, and density is per unit length. Probability mass in a small interval must be conserved:

pY(y)·|dy| = pX(x)·|dx|   (same mass, relabelled coordinates)
⇒ pY(y) = pX(x)·|dx/dy|   ⇒   pY(y) = pX(x)·|det Jf⁻¹(y)|  in ℝd

The Jacobian determinant is the local volume-change factor; dividing density by stretch keeps the area under the curve equal to 1. Normalising flows are this formula made into an architecture — stacks of invertible maps whose log|det J| is cheap to compute.

05 · The ML readingLikelihood: the same function, read sideways

A model density p(x | θ) is one object read two ways. Fix θ and vary x: it is a distribution over data, and it integrates to 1. Fix the observed data x and vary θ: it is the likelihood L(θ) = p(x | θ) — a score for parameters, which integrates to nothing in particular over θ and is not a distribution over θ at all.

Density · x varies, θ fixed
"Given these parameters, how is data distributed?" Normalised over x. Used for sampling and evaluation.
Likelihood · θ varies, x fixed
"Given this data, how plausible is each parameter?" Not normalised over θ. Used for fitting — see MLE vs MAP.

Because individual continuous datapoints have probability zero, "the probability of the data" always silently means the density evaluated at the data — which is why log-likelihoods of continuous models can be positive (densities above 1), a fact that confuses everyone exactly once.

Mental Model