Generative Modelling

GANs

Samples without densities, and a loss you have to train

01 · First principlesWhat if we never need p(x)?

Most generative models start by trying to write down a density pθ(x) and maximise likelihood. But for images, nobody asked for a density. We want one thing: samples that look like the data. A density is a means, not the goal — and an expensive means, because normalised densities over images are brutally hard (see the score function for exactly why).

So drop the density. Define the model implicitly: draw z ~ N(0, I), push it through a network Gθ, and call Gθ(z) a sample. We can sample all day; we just cannot evaluate pθ(x) at any point.

This creates the real problem. Training needs a loss, and every familiar loss compares model density to data — which we just gave up. Per-pixel losses against real images fail badly: the average of many sharp plausible images is a blurry implausible one, and L2 rewards the average.

The forcing question: how do you train a sampler when you cannot evaluate its density and pixel distance rewards blur? GANs answer: learn the loss too.

02 · The mechanismThe discriminator is a learned loss

Introduce a second network Dφ(x) ∈ (0, 1), trained to output the probability that x is real. The generator is trained to make D wrong. The two play a minimax game:

minG maxD   Ex~data[log D(x)] + Ez[log(1 − D(G(z)))]
D right on real D right on fake

The analogy is the standard one because it is exact: a forger and an art expert, each improving only because the other does. The expert is the loss function, and it keeps moving — wherever the generator's samples are currently distinguishable from data, the discriminator points there.

We can solve the inner game in closed form. For fixed G, taking the derivative pointwise gives the optimal discriminator

D*(x) = pdata(x) / (pdata(x) + pg(x))

and substituting D* back into the objective leaves

minG   2·JS(pdata ‖ pg) − log 4

So at discriminator optimality, the generator is minimising the Jensen–Shannon divergence to the data distribution — a real statistical objective, recovered without ever writing down a density. That is the elegance. The instability comes from the same place: we never actually have D*, only a network chasing it.

03 · How it breaksTwo characteristic failures

Saturating gradients

Early in training, fakes are obvious, so D(G(z)) ≈ 0. The generator's loss term log(1 − D(G(z))) is then flat — its gradient vanishes precisely when the generator most needs a signal. The forger gets caught instantly and learns nothing from being caught.

The standard fix is the non-saturating loss: instead of minimising log(1 − D(G(z))), the generator maximises log D(G(z)). Same fixed points, but now the gradient is strong when fakes are bad and weak when they are good, which is the orientation you want.

Mode collapse

Nothing in the objective forces the generator to cover all of pdata. Producing one perfectly convincing kind of sample — one face, one digit — can fool the current discriminator just fine. The discriminator then learns to reject that mode, the generator hops to another, and the pair orbits forever instead of converging. The game has a good equilibrium, but simultaneous gradient descent is not obliged to find it.

Why JS makes it worse
When pdata and pg have (near-)disjoint support — typical for images on low-dimensional manifolds — JS divergence saturates at a constant. A near-optimal D gives the generator essentially zero useful gradient.
WGAN in two lines
Replace JS with the Wasserstein-1 distance, estimated by a Lipschitz-constrained critic: max‖f‖L≤1 Edata[f(x)] − Eg[f(x)]. Earth-mover distance stays smooth even across disjoint supports, so gradients survive.

04 · Visualize itMode collapse

DATA: THREE MODES p_data GENERATOR: ONE MODE p_g

The generator buys a perfect score on one mode (solid terracotta) while ignoring the rest of the data (dashed blue). The minimax loss does not object until D learns to.

05 · LegacyWhat survived

As the default image generator, GANs lost to diffusion: likelihood-free training bought sharpness but cost mode coverage and stability, and diffusion delivered the coverage with a boring, stable regression loss. The single-step sampling speed of GANs remains genuinely enviable, which is why distillation of diffusion models back into few-step generators keeps borrowing adversarial machinery.

The deeper idea — a trained network as a perceptual loss — never went away. Adversarial terms still sharpen the decoders of latent-diffusion autoencoders, super-resolution models, and codecs, anywhere "looks real to a critic" beats "close in L2".

PropertyGANDiffusion
Sampling cost1 forward passmany steps
Training stabilityadversarial gameplain regression
Mode coveragecollapse-pronelikelihood-anchored
Density / likelihoodnonebound available
Mental Model