Generative Modelling

Classifier-Free Guidance

Exaggerate the difference the prompt makes

01 · First principlesWhy conditional samples come out bland

Train a conditional diffusion model p(x|y) on image–caption pairs and sample with the caption "a corgi in a spacesuit". The result is usually plausible — and weakly conditioned: dog-ish, vaguely costume-ish, generic. Why?

Because maximum likelihood is honest. The model learns the full conditional distribution, including its enormous fuzzy margins: images that are only somewhat corgi, scenes where the caption fits loosely. Sampling draws from all of that mass. Nothing in the training objective rewards being emphatically on-prompt; the condition enters as one input among many, and the denoiser can do most of its loss reduction from the noisy image alone, leaning on y only where it must. The condition is a whisper, and we would like a dial to turn it into a command.

02 · First attemptClassifier guidance, and why it hurts

Bayes suggests the dial. Since p(x|y) ∝ p(x) p(y|x), the conditional score splits, and we can overweight the classifier term with a scale s > 1:

x log p̃(x|y) = ∇x log p(x) + s · ∇x log p(y|x)
push toward regions a classifier labels as y — amplified

This works (it powered the "diffusion beats GANs" result), but the cost is steep. The gradient must be evaluated at every noise level t, so an ordinary classifier is useless — we must train a separate classifier on noised images at all noise scales, per dataset. Worse, classifier gradients at high noise are notoriously erratic (the classifier confidently labels static), and adversarial-style gradients can satisfy the classifier without satisfying a human. An extra model, extra training, fragile gradients: painful.

03 · The trickMake the model its own classifier

Rearrange the same Bayes identity: ∇ log p(y|x) = ∇ log p(x|y) − ∇ log p(x). The classifier gradient is just the difference between the conditional and unconditional scores — two things one diffusion model can supply, if we train it to do both jobs. The recipe is almost embarrassingly cheap: during training, randomly replace the condition with a null token ∅ about 10% of the time. One network now computes ε̂(xt, y) and ε̂(xt, ∅). At sampling time, extrapolate:

ε̂guided = ε̂(xt, ∅) + s · ( ε̂(xt, y) − ε̂(xt, ∅) )
the direction "what the prompt changes", scaled by s

At s = 1 this is the ordinary conditional model. At s > 1 (typically 5–10) we step past the conditional prediction, along the line from unconditional to conditional — exaggerating exactly the component of the denoising direction that the prompt is responsible for.

x_t ε̂(x_t, ∅) ε̂(x_t, y) difference guided, s > 1 START AT ∅, WALK s TIMES THE (COND − UNCOND) GAP — PAST THE CONDITIONAL ARROW

Guidance is linear extrapolation in prediction space: keep going in the direction the prompt pulled you, beyond where the model would have stopped.

04 · What it meansThe Bayes view: a sharpened posterior

Substituting scores for ε̂ (they differ by a factor of −σt), the guided field is the score of an unnormalised distribution:

p̃(x|y) ∝ p(x) · p(y|x)s  ∝  p(x|y) · p(y|x)s−1
the implicit classifier, raised to the power s

We are not sampling the true conditional. We are sampling a sharpened one, reweighted toward points where the condition is unambiguously recognisable. Raising a likelihood to a power s > 1 concentrates mass on its confident region and starves the fuzzy margins — which is precisely the bland mass we wanted to suppress. (Strictly, the guided field is no longer the exact score of any normalised density at every t; it is an extremely useful heuristic that the Bayes view explains rather than licenses.)

05 · The costThe dial trades diversity for adherence

Scale sSamplesFailure mode
1Faithful to p(x|y), diverseweak prompt adherence, generic outputs
3 – 10Sharp, on-prompt, photogenicdiversity visibly narrows; samples cluster on archetypes
15 +Caricature of the promptoversaturated colours, burned contrast, artefacts

The failure at high s has a mechanical reading: extrapolation can push xt off the data manifold into regions the denoiser never saw, and the per-pixel overshoot accumulates as the blown-out, oversaturated look. Practical mitigations exist — clipping or rescaling the guided prediction, decaying s across timesteps, applying guidance only in a middle interval of the trajectory — all of them ways of taking the extrapolation seriously where it helps and reining it in where the linear approximation stops being trustworthy. The same trick transplants directly to flow matching (guide the velocity field) and to autoregressive models (logit extrapolation): wherever a model can be run with and without its condition, the gap between the two runs is a steerable direction.

Mental Model