Take the step size to zero and the whole zoo becomes one equation
The forward process of DDPM is a discrete recipe: at each of a thousand steps, shrink the image slightly and add a small Gaussian kick. A thousand is an arbitrary number. So ask the physicist's question: what happens if we make the steps smaller and more numerous, without bound?
Each discrete step has the shape xt+Δt = xt + (deterministic nudge)·Δt + (random kick)·√Δt. The √Δt on the kick is the signature of Brownian motion — independent jitters accumulate like a random walk, so their standard deviation grows with the square root of time. Send Δt → 0 and the recipe becomes a stochastic differential equation:
Here dW is an infinitesimal Brownian increment. The discrete chain was never the object; it was a numerical approximation of this continuous process, written down before anyone noticed.
Forward noising is easy to run but useless for generation; we need the film played backwards. Naively, reversing a stochastic process sounds ill-posed — noise destroys information. The surprise (Anderson, 1982) is that the reversal is itself a clean SDE, and it demands exactly one unknown quantity:
Time runs backwards (dt is negative, dW̄ is reverse Brownian motion), and the only term we cannot write down is the score ∇x log pt(x). This is the hinge of the whole subject: the impossible-sounding task "reverse the destruction of information" reduces precisely to the learnable task "estimate the score at every noise level" — which denoising score matching already solves, and which the ε-network of DDPM already computes in disguise (ε̂ = −σt · score).
There is a second, stranger reversal. For every diffusion SDE there exists a deterministic ODE whose solution has the same marginal distribution pt at every time:
Individual trajectories differ completely — the SDE jitters, the ODE glides — but the cloud of particles evolves identically. The factor of ½ is the bookkeeping: half of the score term in the reverse SDE was there to cancel the incoming noise, and with no noise injected, only half is needed.
Same marginals at every vertical slice. The SDE wanders; the ODE is its deterministic twin, gliding through the same evolving distribution.
The deterministic twin buys real things: every data point gets a unique, invertible latent code; exact likelihoods become computable via the instantaneous change of variables (a Jacobian trace, not a determinant); and interpolation in latent space becomes meaningful.
The two historical schools of diffusion are nothing but two parameter choices in dx = f dt + g dW.
Two communities, one equation. The unification is not aesthetic tidiness; it means every result proven for one lineage (training objectives, reverse processes, likelihood bounds) transfers to the other by substituting f and g.
Here is what the continuous view purchases. Once generation is "solve this SDE/ODE backwards", the sampler is no longer part of the model — it is a numerical integration scheme, and sixty years of numerical analysis apply off the shelf.