Generative Modelling

Diffusion: Forward Process

Destroying data carefully, so that creation can be learned

01 · First principlesWhy destroy the data at all?

Generation in one shot — map pure noise to a finished image in a single step — asks one network to invent everything at once: layout, objects, texture, lighting. GANs attempt exactly this and pay for it in training instability; VAEs pay in blur. The transformation from noise to data is simply too violent to learn as one move.

Diffusion's idea is to never ask for the violent move. Instead, define a forward process that destroys data in many tiny steps, each adding a small amount of Gaussian noise. Each tiny destruction is trivially simple — and, crucially, a tiny destruction is only slightly ambiguous to undo. Then train a network to undo one tiny step at a time (the reverse process).

The analogy is stirring milk into coffee. One hundred small stirs take the swirl to uniform beige gradually; between any two consecutive frames, the change is small enough that you could plausibly say what the previous frame looked like. Asked to reconstruct the original swirl from the final beige in one step, you could not. Asked to reverse one small stir, you nearly can — and "nearly, with small Gaussian uncertainty" is exactly what a network can learn.

The design principle: we choose the destruction precisely so that its reversal decomposes into many easy, locally-Gaussian problems.

02 · The processSmall Gaussian steps, and the closed form

Fix a variance schedule β₁, …, β_T (small numbers, e.g. 10⁻⁴ to 0.02). Each step shrinks the signal slightly and tops it up with fresh noise:

q(x_t | x_t−1) = N( √(1−β_t) · x_t−1, β_t I )

The shrink factor √(1−β_t) matters: it keeps the total variance bounded, so the process converges to N(0, I) instead of exploding. Now the key engineering fact. Because a Gaussian convolved with a Gaussian is Gaussian, the whole chain composes in closed form. Writing α_t = 1−β_t and ᾱ_t = ∏_s≤t α_s:

q(x_t | x₀) = N( √ᾱ_t · x₀, (1−ᾱ_t) I ) ⇔ x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε

surviving signal accumulated noise

This single line is why diffusion trains cheaply. To get a training example at noise level t, we do not simulate t steps of stirring — we jump straight there with one sample of ε. Every minibatch can hit a random t at the cost of one multiply-add. Without this closed form, training would require running the chain, and the method would be impractical.

As t → T, ᾱ_t → 0 and q(x_T|x₀) → N(0, I) regardless of x₀: every image is stirred into the same beige. That shared endpoint is what the sampler will start from.

03 · Visualize itA 1-D distribution dissolving

A bimodal data distribution under increasing noise: modes broaden, merge, and converge to the same standard Gaussian (dashed). All structure is gone by t = T — by construction.

04 · The knobSchedules and signal-to-noise ratio

The schedule {β_t} is best read through one summary number, the signal-to-noise ratio at step t:

SNR(t) = ᾱ_t / (1 − ᾱ_t)

SNR must travel from very large (clean data) to near zero (pure noise); the schedule decides how it spends time along the way. That allocation is a curriculum: each noise level the network sees is a different task — high SNR steps teach fine texture repair, low SNR steps teach global layout from almost nothing.

Schedule	Behaviour	Caveat
Linear β (original DDPM)	Simple; destroys signal aggressively early on.	Wastes many late steps at SNR ≈ 0, where there is nothing left to learn.
Cosine	ᾱ_t follows a cosine; SNR decays smoothly, mid-range levels get more time.	The common default for pixel-space models.
Shifted / resolution-aware	Shift SNR lower for high-resolution images, where redundancy among pixels makes a given noise level effectively easier.	The "right" schedule depends on data dimensionality — there is no universal one.

One subtlety worth keeping: the forward process has no learned parameters. Everything trainable lives in the reverse process; the forward process is pure scaffolding, chosen once, that defines the family of denoising tasks. Its continuous-time limit is taken up in diffusion as SDEs.

Mental Model

One impossible problem (noise → data) is exchanged for a thousand easy ones (slightly noisy → slightly less noisy).
Milk into coffee: many small stirs, each one nearly reversible; the full stir is not.
x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε — teleport to any noise level in one line; this closed form is what makes training affordable.
The schedule is a curriculum over SNR: where it lingers is where the model gets the most practice.
The forward process learns nothing; it exists only to define the tasks the reverse process will solve.