DDPM and DDIM — unstirring the coffee, one step at a time
The forward process destroyed data in tiny Gaussian steps. We now want q(xt−1|xt) — the distribution over "what the previous frame was". In general, reversing a stochastic process is as hard as knowing the data distribution itself. The whole construction was arranged so that one special case holds:
So the entire problem of generation reduces to: estimate the score of the noised data at every noise level. Everything else is plumbing.
Condition on x0 and the reverse step has an exact closed form (Bayes on three Gaussians): q(xt−1|xt, x0) = N(μ̃(xt, x0), β̃tI). At sampling time we lack x0 — but the forward closed form xt = √ᾱt x0 + √(1−ᾱt) ε says that knowing the noise ε is the same as knowing x0. So train a network εθ(xt, t) to predict the noise that was added. The DDPM derivation starts from a VAE-style ELBO over the chain, but after simplification (dropping per-step weights, which empirically helps) the loss is plain regression:
No adversary, no encoder, no partition function — sample an image, a timestep, a noise vector; one forward pass; squared error. This boring loss is most of why diffusion training is so stable. And it is denoising score matching in disguise: the predicted noise gives the score directly,
Start at xT ~ N(0, I) and walk the chain backwards. At each step, use εθ to form the mean of the reverse Gaussian, then add fresh noise scaled by the step's variance:
This is Langevin-flavoured sampling on a ladder of noise levels: denoise a little, re-noise a little less. The re-injection of noise is what keeps the sampler honest about the remaining uncertainty at each level. The cost is the step count — the Gaussian approximation to the reverse step is only valid when steps are small, so DDPM needs the full T (originally 1000) network evaluations. One image, a thousand forward passes.
DDIM's observation: the training loss above only ever uses the marginals q(xt|x0). Many different joint processes share those marginals — including non-Markovian ones in which xt−1 depends on both xt and x0 with zero injected noise. Pick that one, plug in the network's estimate x̂0 = (xt − √(1−ᾱt) εθ)/√ᾱt, and the update becomes deterministic:
Read the move: estimate the clean image, then place it at noise level t−1 using the predicted noise direction rather than a fresh draw. Because no randomness enters, the trajectory is a smooth path that tolerates big jumps — 20 to 50 steps instead of 1000 — on the very same trained network. No retraining, only a different walk.
Determinism buys two further properties. The map from xT to the image is now a function, so xT acts as a true latent code (interpolate between two codes, get a semantic blend). And the map is invertible: run the updates in reverse to encode a real image into noise — the basis of most diffusion-based image editing. DDIM is, in fact, a discretisation of the probability-flow ODE of the SDE view.
| DDPM | DDIM | |
|---|---|---|
| Reverse step | Gaussian, fresh noise each step | Deterministic (σ = 0) |
| Chain | Markovian, ancestral | Non-Markovian, same marginals |
| Steps needed | ~1000 | ~20–50 |
| xT → image | One-to-many (stochastic) | Function; invertible; latent space |
| Sample diversity at fixed xT | Yes, from injected noise | None — all variety comes from xT |
| Retraining required | — | None; same εθ |
Same trained network, two samplers. DDPM staggers home, re-noising at every step; DDIM glides along a smooth deterministic trajectory it can traverse in large jumps.