Exaggerate the difference the prompt makes
Train a conditional diffusion model p(x|y) on image–caption pairs and sample with the caption "a corgi in a spacesuit". The result is usually plausible — and weakly conditioned: dog-ish, vaguely costume-ish, generic. Why?
Because maximum likelihood is honest. The model learns the full conditional distribution, including its enormous fuzzy margins: images that are only somewhat corgi, scenes where the caption fits loosely. Sampling draws from all of that mass. Nothing in the training objective rewards being emphatically on-prompt; the condition enters as one input among many, and the denoiser can do most of its loss reduction from the noisy image alone, leaning on y only where it must. The condition is a whisper, and we would like a dial to turn it into a command.
Bayes suggests the dial. Since p(x|y) ∝ p(x) p(y|x), the conditional score splits, and we can overweight the classifier term with a scale s > 1:
This works (it powered the "diffusion beats GANs" result), but the cost is steep. The gradient must be evaluated at every noise level t, so an ordinary classifier is useless — we must train a separate classifier on noised images at all noise scales, per dataset. Worse, classifier gradients at high noise are notoriously erratic (the classifier confidently labels static), and adversarial-style gradients can satisfy the classifier without satisfying a human. An extra model, extra training, fragile gradients: painful.
Rearrange the same Bayes identity: ∇ log p(y|x) = ∇ log p(x|y) − ∇ log p(x). The classifier gradient is just the difference between the conditional and unconditional scores — two things one diffusion model can supply, if we train it to do both jobs. The recipe is almost embarrassingly cheap: during training, randomly replace the condition with a null token ∅ about 10% of the time. One network now computes ε̂(xt, y) and ε̂(xt, ∅). At sampling time, extrapolate:
At s = 1 this is the ordinary conditional model. At s > 1 (typically 5–10) we step past the conditional prediction, along the line from unconditional to conditional — exaggerating exactly the component of the denoising direction that the prompt is responsible for.
Guidance is linear extrapolation in prediction space: keep going in the direction the prompt pulled you, beyond where the model would have stopped.
Substituting scores for ε̂ (they differ by a factor of −σt), the guided field is the score of an unnormalised distribution:
We are not sampling the true conditional. We are sampling a sharpened one, reweighted toward points where the condition is unambiguously recognisable. Raising a likelihood to a power s > 1 concentrates mass on its confident region and starves the fuzzy margins — which is precisely the bland mass we wanted to suppress. (Strictly, the guided field is no longer the exact score of any normalised density at every t; it is an extremely useful heuristic that the Bayes view explains rather than licenses.)
| Scale s | Samples | Failure mode |
|---|---|---|
| 1 | Faithful to p(x|y), diverse | weak prompt adherence, generic outputs |
| 3 – 10 | Sharp, on-prompt, photogenic | diversity visibly narrows; samples cluster on archetypes |
| 15 + | Caricature of the prompt | oversaturated colours, burned contrast, artefacts |
The failure at high s has a mechanical reading: extrapolation can push xt off the data manifold into regions the denoiser never saw, and the per-pixel overshoot accumulates as the blown-out, oversaturated look. Practical mitigations exist — clipping or rescaling the guided prediction, decaying s across timesteps, applying guidance only in a middle interval of the trajectory — all of them ways of taking the extrapolation seriously where it helps and reining it in where the linear approximation stops being trustworthy. The same trick transplants directly to flow matching (guide the velocity field) and to autoregressive models (logit extrapolation): wherever a model can be run with and without its condition, the gap between the two runs is a steerable direction.