Surprise, averaged — and why your training loss is a code length
Start with one event of probability p. We want a number s(p) measuring how surprising it is when it happens. Three requirements pin the answer down almost uniquely:
Requirement 3 is the decisive one — a function turning products into sums is a logarithm. So:
A fair-coin head is 1 bit of surprise. A 1-in-1024 event is 10 bits. Doubling the rarity adds one bit; surprise is the receipt randomness hands you, denominated in logs.
A distribution produces events all the time, each with its own surprise. Average them under the distribution itself (an expectation, like everything else):
Shannon's theorem gives the operational meaning in one line: H(p) is the minimum average number of bits per symbol any code can achieve for data drawn from p. Frequent symbols get short codewords, rare ones get long codewords (length −log p(x) each), and no cleverness beats the average. Entropy is not a metaphor for information; it is the irreducible invoice for transmitting it.
A deterministic source — one outcome, probability 1 — has H = 0: nothing to say, nothing to transmit. At the other end, the uniform distribution over K outcomes maximises entropy at log K: every guess is as bad as every other, maximal ignorance. Everything interesting lives between.
Binary entropy. Certain at either end (H = 0), maximally uncertain at p = 0.5 (H = 1 bit).
This curve is why label smoothing, exploration bonuses, and entropy regularisation all reach for the same dial: pushing a policy or a softmax toward the top of the curve keeps options open, pushing toward the ends commits.
Suppose data comes from p but you built your code (your model) for q. You pay q's codeword lengths, −log q(x), at p's frequencies:
This is exactly the loss a classifier trains on: the labels are samples from p (usually one-hot), the softmax output is q, and minimising −log q(label) is minimising the average code length your model assigns the truth. Since H(p) is fixed by the data, minimising cross-entropy is minimising the KL divergence — the wrong-book penalty is the only part you can reduce.
| Quantity | Question it answers | Lives in |
|---|---|---|
| −log p(x) | How surprising was this one event? | Per-sample loss values |
| H(p) | How unpredictable is this source, irreducibly? | Noise floor; best achievable loss |
| H(p, q) | What do I pay using model q on data p? | The training objective |
| KL(p ∥ q) | How much of that payment was avoidable? | Its own note |