KL, symmetrised through a mixture — and why GAN gradients die
KL divergence is built for an asymmetric job: a true distribution p judging a model q. Ask it instead to compare two arbitrary distributions on equal footing — two generators, two corpora, real versus fake — and two flaws surface immediately:
For early generative models this is not a corner case; it is the default. A fresh generator and the real data typically occupy disjoint slivers of a high-dimensional space.
Neither p nor q deserves to be the reference, so make a neutral one — the average distribution m — and let each side pay KL against it:
The mixture fixes both flaws at once. Symmetry is immediate from the formula. And the infinity is gone: wherever p(x) > 0 we have m(x) ≥ p(x)/2 > 0, so the log-ratio inside each KL is at most log 2 — the reference can never be empty where either side has mass. Hence:
Top: overlapping supports — JSD varies smoothly as q moves. Bottom: disjoint supports — JSD is pinned at log 2 regardless of the gap.
The bottom panel shows the catch baked into the bound. Once supports are disjoint, JSD reports log 2 whatever the distance between them. Bounded means saturating, and saturating means the derivative with respect to "move q toward p" is zero. Remember this panel; it reappears as a training pathology in the next section.
The original GAN objective, with an optimal discriminator D* plugged in, reduces to:
So a GAN generator is, in the idealised limit, performing gradient descent on JSD. Elegant — and the source of the field's most famous failure. Early in training, pG and pdata have essentially disjoint supports (both are thin manifolds in pixel space), which is precisely the flat regime above: JSD sits at log 2, a confident discriminator saturates, and the generator's gradient vanishes. The theory's tidiest property and the practice's worst instability are the same fact.
| Property | KL(p ∥ q) | JSD(p, q) |
|---|---|---|
| Symmetric | No | Yes |
| Bounded | No (∞ on disjoint support) | Yes — [0, log 2] |
| Metric (after √) | No | Yes |
| Gradient between disjoint supports | Undefined / ∞ | Zero (saturated) |
| Natural role | Truth judging a model (MLE, VAEs, RLHF) | Two peers compared fairly (GAN theory, corpus drift, embedding-distribution shift) |
Note the shared weakness in the fourth row: JSD fixes KL's infinity but not the underlying blindness between non-overlapping distributions. When that regime is the one you care about, neither divergence is the right tool.