The price of believing q when the truth is p
From the entropy note: a code built for q assigns each event a length of −log q(x). If the data actually comes from p, the average overspend — extra bits per symbol, beyond the irreducible H(p) — is:
Read the structure carefully: the log-ratio measures pointwise wrongness, and the expectation is taken under p — the truth decides which mistakes matter. Wherever p puts no mass, q can be arbitrarily wrong for free; wherever p puts mass and q puts nearly none, log p/q explodes. (That explosion is not a bug; it is the whole personality of the measure, as we will see.)
log is concave, so Jensen's inequality says E[log Z] ≤ log E[Z]. Apply it to Z = q/p under p:
So KL behaves like a distance in one respect — zero exactly at identity, positive otherwise — and we are tempted to treat it as one. It is not one, and the failure is instructive rather than embarrassing.
KL(p ∥ q) ≠ KL(q ∥ p), and the direction you optimise decides what kind of approximation you get. Fit a single Gaussian q to a bimodal target p and the two directions give two different answers, both correct by their own lights:
One bimodal target, one Gaussian budget, two KL directions, two philosophies of approximation.
Three objectives that sound different and are the same optimisation. With p the data distribution fixed and qθ the model:
And the rightmost term, estimated with samples, is exactly the log-likelihood of the training set. Minimising cross-entropy, minimising forward KL, and maximum likelihood are one procedure wearing three names — which is why classifiers inherit forward KL's mean-seeking generosity toward covering all the data.
| System | Term | What the KL is doing |
|---|---|---|
| Any classifier / LM | cross-entropy loss | Forward KL to the data distribution, via the trinity above. |
| VAE | KL(q(z|x) ∥ p(z)) | Keeps the encoder's posterior pinned near the prior so the latent space stays usable. |
| RLHF / PPO | KL(π ∥ πref) penalty | A leash: lets the policy chase reward only as far as it stays probabilistically close to the reference model. |
| Distillation | KL(teacher ∥ student) | The student matches the teacher's full soft distribution, not just its argmax — the dark knowledge is in the ratios. |