Samples without densities, and a loss you have to train
Most generative models start by trying to write down a density pθ(x) and maximise likelihood. But for images, nobody asked for a density. We want one thing: samples that look like the data. A density is a means, not the goal — and an expensive means, because normalised densities over images are brutally hard (see the score function for exactly why).
So drop the density. Define the model implicitly: draw z ~ N(0, I), push it through a network Gθ, and call Gθ(z) a sample. We can sample all day; we just cannot evaluate pθ(x) at any point.
This creates the real problem. Training needs a loss, and every familiar loss compares model density to data — which we just gave up. Per-pixel losses against real images fail badly: the average of many sharp plausible images is a blurry implausible one, and L2 rewards the average.
Introduce a second network Dφ(x) ∈ (0, 1), trained to output the probability that x is real. The generator is trained to make D wrong. The two play a minimax game:
The analogy is the standard one because it is exact: a forger and an art expert, each improving only because the other does. The expert is the loss function, and it keeps moving — wherever the generator's samples are currently distinguishable from data, the discriminator points there.
We can solve the inner game in closed form. For fixed G, taking the derivative pointwise gives the optimal discriminator
and substituting D* back into the objective leaves
So at discriminator optimality, the generator is minimising the Jensen–Shannon divergence to the data distribution — a real statistical objective, recovered without ever writing down a density. That is the elegance. The instability comes from the same place: we never actually have D*, only a network chasing it.
Saturating gradients
Early in training, fakes are obvious, so D(G(z)) ≈ 0. The generator's loss term log(1 − D(G(z))) is then flat — its gradient vanishes precisely when the generator most needs a signal. The forger gets caught instantly and learns nothing from being caught.
The standard fix is the non-saturating loss: instead of minimising log(1 − D(G(z))), the generator maximises log D(G(z)). Same fixed points, but now the gradient is strong when fakes are bad and weak when they are good, which is the orientation you want.
Mode collapse
Nothing in the objective forces the generator to cover all of pdata. Producing one perfectly convincing kind of sample — one face, one digit — can fool the current discriminator just fine. The discriminator then learns to reject that mode, the generator hops to another, and the pair orbits forever instead of converging. The game has a good equilibrium, but simultaneous gradient descent is not obliged to find it.
The generator buys a perfect score on one mode (solid terracotta) while ignoring the rest of the data (dashed blue). The minimax loss does not object until D learns to.
As the default image generator, GANs lost to diffusion: likelihood-free training bought sharpness but cost mode coverage and stability, and diffusion delivered the coverage with a boring, stable regression loss. The single-step sampling speed of GANs remains genuinely enviable, which is why distillation of diffusion models back into few-step generators keeps borrowing adversarial machinery.
The deeper idea — a trained network as a perceptual loss — never went away. Adversarial terms still sharpen the decoders of latent-diffusion autoencoders, super-resolution models, and codecs, anywhere "looks real to a critic" beats "close in L2".
| Property | GAN | Diffusion |
|---|---|---|
| Sampling cost | 1 forward pass | many steps |
| Training stability | adversarial game | plain regression |
| Mode coverage | collapse-prone | likelihood-anchored |
| Density / likelihood | none | bound available |