Generative Modelling

Diffusion: Reverse Process

DDPM and DDIM — unstirring the coffee, one step at a time

01 · First principlesWhy small steps are reversible

The forward process destroyed data in tiny Gaussian steps. We now want q(x_t−1|x_t) — the distribution over "what the previous frame was". In general, reversing a stochastic process is as hard as knowing the data distribution itself. The whole construction was arranged so that one special case holds:

The key fact: when the forward steps are small and Gaussian, the true reverse conditionals are, to first order, also Gaussian — with a mean that is a small correction of x_t in the direction of the score ∇ log q(x_t). A Gaussian with a computable mean is something a network can fit.

So the entire problem of generation reduces to: estimate the score of the noised data at every noise level. Everything else is plumbing.

02 · The parameterisationPredict the noise

Condition on x₀ and the reverse step has an exact closed form (Bayes on three Gaussians): q(x_t−1|x_t, x₀) = N(μ̃(x_t, x₀), β̃_tI). At sampling time we lack x₀ — but the forward closed form x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε says that knowing the noise ε is the same as knowing x₀. So train a network ε_θ(x_t, t) to predict the noise that was added. The DDPM derivation starts from a VAE-style ELBO over the chain, but after simplification (dropping per-step weights, which empirically helps) the loss is plain regression:

L_simple = E_{x₀, t, ε} [ ‖ ε − ε_θ(√ᾱ_t x₀ + √(1−ᾱ_t) ε, t) ‖² ]

noise actually added network's guess

No adversary, no encoder, no partition function — sample an image, a timestep, a noise vector; one forward pass; squared error. This boring loss is most of why diffusion training is so stable. And it is denoising score matching in disguise: the predicted noise gives the score directly,

∇_{x_t} log q(x_t) ≈ −ε_θ(x_t, t) / √(1−ᾱ_t)

03 · Sampler oneDDPM: stochastic ancestral sampling

Start at x_T ~ N(0, I) and walk the chain backwards. At each step, use ε_θ to form the mean of the reverse Gaussian, then add fresh noise scaled by the step's variance:

x_t−1 = (1/√α_t) ( x_t − (β_t/√(1−ᾱ_t)) · ε_θ(x_t, t) ) + σ_t z, z ~ N(0, I)

fresh noise each step

This is Langevin-flavoured sampling on a ladder of noise levels: denoise a little, re-noise a little less. The re-injection of noise is what keeps the sampler honest about the remaining uncertainty at each level. The cost is the step count — the Gaussian approximation to the reverse step is only valid when steps are small, so DDPM needs the full T (originally 1000) network evaluations. One image, a thousand forward passes.

04 · Sampler twoDDIM: same network, deterministic path

DDIM's observation: the training loss above only ever uses the marginals q(x_t|x₀). Many different joint processes share those marginals — including non-Markovian ones in which x_t−1 depends on both x_t and x₀ with zero injected noise. Pick that one, plug in the network's estimate x̂₀ = (x_t − √(1−ᾱ_t) ε_θ)/√ᾱ_t, and the update becomes deterministic:

x_t−1 = √ᾱ_t−1 · x̂₀ + √(1−ᾱ_t−1) · ε_θ(x_t, t)

re-noise the current best guess of x₀ — with the predicted, not fresh, noise

Read the move: estimate the clean image, then place it at noise level t−1 using the predicted noise direction rather than a fresh draw. Because no randomness enters, the trajectory is a smooth path that tolerates big jumps — 20 to 50 steps instead of 1000 — on the very same trained network. No retraining, only a different walk.

Determinism buys two further properties. The map from x_T to the image is now a function, so x_T acts as a true latent code (interpolate between two codes, get a semantic blend). And the map is invertible: run the updates in reverse to encode a real image into noise — the basis of most diffusion-based image editing. DDIM is, in fact, a discretisation of the probability-flow ODE of the SDE view.

	DDPM	DDIM
Reverse step	Gaussian, fresh noise each step	Deterministic (σ = 0)
Chain	Markovian, ancestral	Non-Markovian, same marginals
Steps needed	~1000	~20–50
x_T → image	One-to-many (stochastic)	Function; invertible; latent space
Sample diversity at fixed x_T	Yes, from injected noise	None — all variety comes from x_T
Retraining required	—	None; same ε_θ

05 · Visualize itTwo walks home

Same trained network, two samplers. DDPM staggers home, re-noising at every step; DDIM glides along a smooth deterministic trajectory it can traverse in large jumps.

Mental Model

Small Gaussian destruction has small Gaussian reversal; the reverse mean is "step along the score", and the score is all the network must learn.
Training is one line: add known noise, predict it, squared error. The stability of diffusion lives in the boringness of this loss.
ε-prediction, x₀-prediction, and score estimation are the same quantity in three coordinate systems.
DDPM walks back stochastically and needs every rung of the ladder; DDIM exploits "same marginals, different joint" to take a deterministic path in big jumps.
Determinism is a feature, not just a speedup: it gives a latent space and an invertible encoder for free.