Generative Modelling

Diffusion as SDEs

Take the step size to zero and the whole zoo becomes one equation

01 · First principlesWhat is the noising chain, in the limit?

The forward process of DDPM is a discrete recipe: at each of a thousand steps, shrink the image slightly and add a small Gaussian kick. A thousand is an arbitrary number. So ask the physicist's question: what happens if we make the steps smaller and more numerous, without bound?

Each discrete step has the shape xt+Δt = xt + (deterministic nudge)·Δt + (random kick)·√Δt. The √Δt on the kick is the signature of Brownian motion — independent jitters accumulate like a random walk, so their standard deviation grows with the square root of time. Send Δt → 0 and the recipe becomes a stochastic differential equation:

dx = f(x, t) dt + g(t) dW
drift — where the flow carries you diffusion — how hard the noise shakes

Here dW is an infinitesimal Brownian increment. The discrete chain was never the object; it was a numerical approximation of this continuous process, written down before anyone noticed.

The reframe: diffusion models do not have a thousand steps. They have a continuous-time process, and "a thousand steps" is one crude way to integrate it.

02 · The hingeAnderson's reverse-time SDE

Forward noising is easy to run but useless for generation; we need the film played backwards. Naively, reversing a stochastic process sounds ill-posed — noise destroys information. The surprise (Anderson, 1982) is that the reversal is itself a clean SDE, and it demands exactly one unknown quantity:

dx = [ f(x, t) − g(t)² x log pt(x) ] dt + g(t) dW̄
the score of the noised marginal at time t

Time runs backwards (dt is negative, dW̄ is reverse Brownian motion), and the only term we cannot write down is the scorex log pt(x). This is the hinge of the whole subject: the impossible-sounding task "reverse the destruction of information" reduces precisely to the learnable task "estimate the score at every noise level" — which denoising score matching already solves, and which the ε-network of DDPM already computes in disguise (ε̂ = −σt · score).

03 · The twinThe probability-flow ODE

There is a second, stranger reversal. For every diffusion SDE there exists a deterministic ODE whose solution has the same marginal distribution pt at every time:

dx = [ f(x, t) − ½ g(t)² ∇x log pt(x) ] dt
no dW — same marginals, zero randomness

Individual trajectories differ completely — the SDE jitters, the ODE glides — but the cloud of particles evolves identically. The factor of ½ is the bookkeeping: half of the score term in the reverse SDE was there to cancel the incoming noise, and with no noise injected, only half is needed.

DATA (t = 0) NOISE (t = T) SDE — JITTERY TWINS PROBABILITY-FLOW ODE — ONE SMOOTH PATH

Same marginals at every vertical slice. The SDE wanders; the ODE is its deterministic twin, gliding through the same evolving distribution.

The deterministic twin buys real things: every data point gets a unique, invertible latent code; exact likelihoods become computable via the instantaneous change of variables (a Jacobian trace, not a determinant); and interpolation in latent space becomes meaningful.

04 · Two lineagesVP and VE are choices of f and g

The two historical schools of diffusion are nothing but two parameter choices in dx = f dt + g dW.

VP — Variance Preserving (DDPM lineage)
f(x,t) = −½β(t)x, g(t) = √β(t). The drift shrinks x while noise is added, so total variance stays ≈ 1. The discrete chain xt = √(1−β)x + √β ε is its Euler step.
VE — Variance Exploding (score-matching lineage)
f(x,t) = 0, g(t) = √(dσ²(t)/dt). No shrinking; noise piles on until σ dwarfs the data. This is the noise ladder of NCSN, the multi-scale fix from score matching.

Two communities, one equation. The unification is not aesthetic tidiness; it means every result proven for one lineage (training objectives, reverse processes, likelihood bounds) transfers to the other by substituting f and g.

05 · The payoffSamplers become numerical integrators

Here is what the continuous view purchases. Once generation is "solve this SDE/ODE backwards", the sampler is no longer part of the model — it is a numerical integration scheme, and sixty years of numerical analysis apply off the shelf.

  1. Ancestral DDPM sampling is revealed as Euler–Maruyama on the reverse SDE: the crudest first-order integrator, hence the thousand steps.
  2. Higher-order solvers (Heun, DPM-Solver, exponential integrators that handle the linear drift exactly) take bigger steps with the same accuracy: 10–50 network calls instead of 1000, with no retraining.
  3. The SDE/ODE dial becomes a design choice: ODE sampling is fast and reproducible; injecting some stochasticity acts as a corrector that forgives accumulated score error.
The shape of the payoff: training and sampling decouple. Train one score network once; choose the integrator afterwards, per deployment, by speed–quality budget. Flow matching pushes this one step further by making the paths straight so that even crude integrators need few steps.
Mental Model