Generative Modelling

The Score Function

The direction uphill, without ever knowing the height

01 · First principlesThe constant that kills density modelling

Write any density as an energy model: pθ(x) = e−Eθ(x) / Zθ, where Zθ = ∫ e−Eθ(x) dx normalises it. The energy is just a neural network — easy. Zθ is an integral over all of image space — hopeless. And maximum likelihood needs Zθ, because making the data probable means making everything else improbable, and you cannot budget probability without knowing the total.

Now take the gradient of the log-density with respect to x, not θ:

x log p(x) = −∇x Eθ(x) − x log Zθ = −∇x Eθ(x)
= 0 (Z does not depend on x)

The normalising constant vanishes. This vector field, s(x) = ∇x log p(x), is the score: at every point it points toward higher data density, with magnitude proportional to how steeply the log-density rises. It encodes the entire shape of the distribution while being completely blind to Z.

The pivot: if knowing "where is the probability" (Z) is intractable but "which way is more probable" (the score) is free of Z, then build the whole generative model out of directions.

02 · Visualize itA field of arrows

MODE A MODE B

The score field ∇x log p(x): every arrow points toward the nearest region of high density. Grey contours are level sets of p; the field knows their shape but not their absolute height.

03 · How it breaksLearning the score, and where naive matching fails

Train a network sθ(x) to equal the true score by minimising Ep[‖sθ(x) − ∇x log p(x)‖²]. The target contains the unknown ∇ log p — circular. Score matching (Hyvärinen) fixes the circularity with integration by parts, leaving an objective in sθ alone, but it requires the trace of the Jacobian ∇xsθ, which is expensive in high dimension.

There is a worse problem. Real data sits near a thin manifold: almost everywhere in pixel space, p(x) ≈ 0, the training expectation puts no weight there, and the learned score is garbage exactly where a sampler starting from random noise begins its journey. A perfect score on the manifold and nonsense everywhere else is a map of the destination with no roads to it.

Denoising score matching solves both at once. Blur the data with Gaussian noise, x̃ = x + σε, and learn the score of the noised distribution pσ. A short calculation (this is Tweedie's insight) shows the conditional score has a closed form:

log pσ(x̃|x) = (x − x̃) / σ² = −ε / σ
the score is the (scaled) noise you added

and that regressing on this conditional target gives the true marginal score in expectation:

minθ Ex, ε[ ‖ sθ(x + σε) + ε/σ ‖² ]  ⇒  sθ ≈ ∇ log pσ

No Jacobians, no Z, and the noise itself spreads training data over the off-manifold regions that need coverage. The cost: we learn the score of a smoothed distribution, not of p itself. Diffusion will turn that cost into the design.

04 · Using itLangevin dynamics: follow the arrows, but jitter

Given a score, sampling is almost gradient ascent on log p:

xk+1 = xk + (η/2) · sθ(xk) + √η · εk,    εk ~ N(0, I)
injected noise — not optional

The noise term is the entire difference between sampling and optimising. Pure ascent collapses every chain onto the modes; the calibrated jitter makes the stationary distribution exactly p, so chains spend time in proportion to probability. Think of a marble rolling into valleys of −log p while the table is shaken: the shaking is what makes it explore the valley floor rather than sit at the deepest point.

One failure remains: with a single small σ, Langevin chains mix poorly between distant modes and the far-field score is still weak. The fix — use a whole ladder of noise levels, large σ to guide from far away, small σ to refine — is precisely the forward process of diffusion, and the ε-prediction network of DDPM is this denoising score model wearing different notation.

Mental Model