Generative Modelling

The Score Function

The direction uphill, without ever knowing the height

01 · First principlesThe constant that kills density modelling

Write any density as an energy model: p_θ(x) = e^−E_θ(x) / Z_θ, where Z_θ = ∫ e^−E_θ(x) dx normalises it. The energy is just a neural network — easy. Z_θ is an integral over all of image space — hopeless. And maximum likelihood needs Z_θ, because making the data probable means making everything else improbable, and you cannot budget probability without knowing the total.

Now take the gradient of the log-density with respect to x, not θ:

∇_x log p(x) = −∇_x E_θ(x) − ∇_x log Z_θ = −∇_x E_θ(x)

= 0 (Z does not depend on x)

The normalising constant vanishes. This vector field, s(x) = ∇_x log p(x), is the score: at every point it points toward higher data density, with magnitude proportional to how steeply the log-density rises. It encodes the entire shape of the distribution while being completely blind to Z.

The pivot: if knowing "where is the probability" (Z) is intractable but "which way is more probable" (the score) is free of Z, then build the whole generative model out of directions.

02 · Visualize itA field of arrows

The score field ∇_x log p(x): every arrow points toward the nearest region of high density. Grey contours are level sets of p; the field knows their shape but not their absolute height.

03 · How it breaksLearning the score, and where naive matching fails

Train a network s_θ(x) to equal the true score by minimising E_p[‖s_θ(x) − ∇_x log p(x)‖²]. The target contains the unknown ∇ log p — circular. Score matching (Hyvärinen) fixes the circularity with integration by parts, leaving an objective in s_θ alone, but it requires the trace of the Jacobian ∇_xs_θ, which is expensive in high dimension.

There is a worse problem. Real data sits near a thin manifold: almost everywhere in pixel space, p(x) ≈ 0, the training expectation puts no weight there, and the learned score is garbage exactly where a sampler starting from random noise begins its journey. A perfect score on the manifold and nonsense everywhere else is a map of the destination with no roads to it.

Denoising score matching solves both at once. Blur the data with Gaussian noise, x̃ = x + σε, and learn the score of the noised distribution p_σ. A short calculation (this is Tweedie's insight) shows the conditional score has a closed form:

∇_x̃ log p_σ(x̃|x) = (x − x̃) / σ² = −ε / σ

the score is the (scaled) noise you added

and that regressing on this conditional target gives the true marginal score in expectation:

min_θ E_{x, ε}[ ‖ s_θ(x + σε) + ε/σ ‖² ] ⇒ s_θ ≈ ∇ log p_σ

No Jacobians, no Z, and the noise itself spreads training data over the off-manifold regions that need coverage. The cost: we learn the score of a smoothed distribution, not of p itself. Diffusion will turn that cost into the design.

04 · Using itLangevin dynamics: follow the arrows, but jitter

Given a score, sampling is almost gradient ascent on log p:

x_k+1 = x_k + (η/2) · s_θ(x_k) + √η · ε_k, ε_k ~ N(0, I)

injected noise — not optional

The noise term is the entire difference between sampling and optimising. Pure ascent collapses every chain onto the modes; the calibrated jitter makes the stationary distribution exactly p, so chains spend time in proportion to probability. Think of a marble rolling into valleys of −log p while the table is shaken: the shaking is what makes it explore the valley floor rather than sit at the deepest point.

One failure remains: with a single small σ, Langevin chains mix poorly between distant modes and the far-field score is still weak. The fix — use a whole ladder of noise levels, large σ to guide from far away, small σ to refine — is precisely the forward process of diffusion, and the ε-prediction network of DDPM is this denoising score model wearing different notation.

Mental Model

The score is the compass of a distribution: direction uphill in log-probability, everywhere, with the unknowable altitude (Z) cancelled out.
Naive score learning fails off the data manifold — exactly where samplers start.
Tweedie's insight: the score of noised data is the negative noise, so "predict the noise" is score estimation in disguise.
Langevin sampling = follow the compass + mandatory jitter; the jitter converts hill-climbing into sampling.
One noise level gives a weak compass; a ladder of noise levels gives diffusion.