Generative Modelling

GANs

Samples without densities, and a loss you have to train

01 · First principlesWhat if we never need p(x)?

Most generative models start by trying to write down a density p_θ(x) and maximise likelihood. But for images, nobody asked for a density. We want one thing: samples that look like the data. A density is a means, not the goal — and an expensive means, because normalised densities over images are brutally hard (see the score function for exactly why).

So drop the density. Define the model implicitly: draw z ~ N(0, I), push it through a network G_θ, and call G_θ(z) a sample. We can sample all day; we just cannot evaluate p_θ(x) at any point.

This creates the real problem. Training needs a loss, and every familiar loss compares model density to data — which we just gave up. Per-pixel losses against real images fail badly: the average of many sharp plausible images is a blurry implausible one, and L2 rewards the average.

The forcing question: how do you train a sampler when you cannot evaluate its density and pixel distance rewards blur? GANs answer: learn the loss too.

02 · The mechanismThe discriminator is a learned loss

Introduce a second network D_φ(x) ∈ (0, 1), trained to output the probability that x is real. The generator is trained to make D wrong. The two play a minimax game:

min_G max_D E_x~data[log D(x)] + E_z[log(1 − D(G(z)))]

D right on real D right on fake

The analogy is the standard one because it is exact: a forger and an art expert, each improving only because the other does. The expert is the loss function, and it keeps moving — wherever the generator's samples are currently distinguishable from data, the discriminator points there.

We can solve the inner game in closed form. For fixed G, taking the derivative pointwise gives the optimal discriminator

D*(x) = p_data(x) / (p_data(x) + p_g(x))

and substituting D* back into the objective leaves

min_G 2·JS(p_data ‖ p_g) − log 4

So at discriminator optimality, the generator is minimising the Jensen–Shannon divergence to the data distribution — a real statistical objective, recovered without ever writing down a density. That is the elegance. The instability comes from the same place: we never actually have D*, only a network chasing it.

03 · How it breaksTwo characteristic failures

Saturating gradients

Early in training, fakes are obvious, so D(G(z)) ≈ 0. The generator's loss term log(1 − D(G(z))) is then flat — its gradient vanishes precisely when the generator most needs a signal. The forger gets caught instantly and learns nothing from being caught.

The standard fix is the non-saturating loss: instead of minimising log(1 − D(G(z))), the generator maximises log D(G(z)). Same fixed points, but now the gradient is strong when fakes are bad and weak when they are good, which is the orientation you want.

Mode collapse

Nothing in the objective forces the generator to cover all of p_data. Producing one perfectly convincing kind of sample — one face, one digit — can fool the current discriminator just fine. The discriminator then learns to reject that mode, the generator hops to another, and the pair orbits forever instead of converging. The game has a good equilibrium, but simultaneous gradient descent is not obliged to find it.

Why JS makes it worse

When p_data and p_g have (near-)disjoint support — typical for images on low-dimensional manifolds — JS divergence saturates at a constant. A near-optimal D gives the generator essentially zero useful gradient.

WGAN in two lines

Replace JS with the Wasserstein-1 distance, estimated by a Lipschitz-constrained critic: max_{‖f‖_L≤1} E_data[f(x)] − E_g[f(x)]. Earth-mover distance stays smooth even across disjoint supports, so gradients survive.

04 · Visualize itMode collapse

The generator buys a perfect score on one mode (solid terracotta) while ignoring the rest of the data (dashed blue). The minimax loss does not object until D learns to.

05 · LegacyWhat survived

As the default image generator, GANs lost to diffusion: likelihood-free training bought sharpness but cost mode coverage and stability, and diffusion delivered the coverage with a boring, stable regression loss. The single-step sampling speed of GANs remains genuinely enviable, which is why distillation of diffusion models back into few-step generators keeps borrowing adversarial machinery.

The deeper idea — a trained network as a perceptual loss — never went away. Adversarial terms still sharpen the decoders of latent-diffusion autoencoders, super-resolution models, and codecs, anywhere "looks real to a critic" beats "close in L2".

Property	GAN	Diffusion
Sampling cost	1 forward pass	many steps
Training stability	adversarial game	plain regression
Mode coverage	collapse-prone	likelihood-anchored
Density / likelihood	none	bound available

Mental Model

A GAN is a sampler plus a learned, moving loss function; the discriminator points at whatever currently gives the fakes away.
At discriminator optimality the game is JS-divergence minimisation — but you never have the optimal discriminator, and that gap is where the instability lives.
Saturation: the generator's gradient dies exactly when it is losing; fix the orientation with the non-saturating loss.
Mode collapse: fooling the critic does not require covering the data; one good forgery suffices.
The lasting legacy is not the architecture but the idea that realism is best measured by a trained network, not a pixel norm.