Destroying data carefully, so that creation can be learned
Generation in one shot — map pure noise to a finished image in a single step — asks one network to invent everything at once: layout, objects, texture, lighting. GANs attempt exactly this and pay for it in training instability; VAEs pay in blur. The transformation from noise to data is simply too violent to learn as one move.
Diffusion's idea is to never ask for the violent move. Instead, define a forward process that destroys data in many tiny steps, each adding a small amount of Gaussian noise. Each tiny destruction is trivially simple — and, crucially, a tiny destruction is only slightly ambiguous to undo. Then train a network to undo one tiny step at a time (the reverse process).
The analogy is stirring milk into coffee. One hundred small stirs take the swirl to uniform beige gradually; between any two consecutive frames, the change is small enough that you could plausibly say what the previous frame looked like. Asked to reconstruct the original swirl from the final beige in one step, you could not. Asked to reverse one small stir, you nearly can — and "nearly, with small Gaussian uncertainty" is exactly what a network can learn.
Fix a variance schedule β1, …, βT (small numbers, e.g. 10−4 to 0.02). Each step shrinks the signal slightly and tops it up with fresh noise:
The shrink factor √(1−βt) matters: it keeps the total variance bounded, so the process converges to N(0, I) instead of exploding. Now the key engineering fact. Because a Gaussian convolved with a Gaussian is Gaussian, the whole chain composes in closed form. Writing αt = 1−βt and ᾱt = ∏s≤t αs:
This single line is why diffusion trains cheaply. To get a training example at noise level t, we do not simulate t steps of stirring — we jump straight there with one sample of ε. Every minibatch can hit a random t at the cost of one multiply-add. Without this closed form, training would require running the chain, and the method would be impractical.
As t → T, ᾱt → 0 and q(xT|x0) → N(0, I) regardless of x0: every image is stirred into the same beige. That shared endpoint is what the sampler will start from.
A bimodal data distribution under increasing noise: modes broaden, merge, and converge to the same standard Gaussian (dashed). All structure is gone by t = T — by construction.
The schedule {βt} is best read through one summary number, the signal-to-noise ratio at step t:
SNR must travel from very large (clean data) to near zero (pure noise); the schedule decides how it spends time along the way. That allocation is a curriculum: each noise level the network sees is a different task — high SNR steps teach fine texture repair, low SNR steps teach global layout from almost nothing.
| Schedule | Behaviour | Caveat |
|---|---|---|
| Linear β (original DDPM) | Simple; destroys signal aggressively early on. | Wastes many late steps at SNR ≈ 0, where there is nothing left to learn. |
| Cosine | ᾱt follows a cosine; SNR decays smoothly, mid-range levels get more time. | The common default for pixel-space models. |
| Shifted / resolution-aware | Shift SNR lower for high-resolution images, where redundancy among pixels makes a given noise level effectively easier. | The "right" schedule depends on data dimensionality — there is no universal one. |