General ML

Entropy

Surprise, averaged — and why your training loss is a code length

01 · First principlesHow surprised should an event make you?

Start with one event of probability p. We want a number s(p) measuring how surprising it is when it happens. Three requirements pin the answer down almost uniquely:

Certain events carry no surprise: s(1) = 0.
Rarer means more surprising: s decreases in p, blowing up as p → 0.
Independent surprises add: two unrelated events of probabilities p and q occur together with probability pq, and the joint surprise should be s(p) + s(q).

Requirement 3 is the decisive one — a function turning products into sums is a logarithm. So:

s(p) = −log p — base 2 gives bits, base e gives nats

A fair-coin head is 1 bit of surprise. A 1-in-1024 event is 10 bits. Doubling the rarity adds one bit; surprise is the receipt randomness hands you, denominated in logs.

02 · The definitionEntropy = expected surprise

A distribution produces events all the time, each with its own surprise. Average them under the distribution itself (an expectation, like everything else):

H(p) = E_x∼p[−log p(x)] = −Σ_x p(x) log p(x)

how surprising this source is, on average

Shannon's theorem gives the operational meaning in one line: H(p) is the minimum average number of bits per symbol any code can achieve for data drawn from p. Frequent symbols get short codewords, rare ones get long codewords (length −log p(x) each), and no cleverness beats the average. Entropy is not a metaphor for information; it is the irreducible invoice for transmitting it.

03 · The extremesUniform maximises, deterministic zeroes

A deterministic source — one outcome, probability 1 — has H = 0: nothing to say, nothing to transmit. At the other end, the uniform distribution over K outcomes maximises entropy at log K: every guess is as bad as every other, maximal ignorance. Everything interesting lives between.

Binary entropy. Certain at either end (H = 0), maximally uncertain at p = 0.5 (H = 1 bit).

This curve is why label smoothing, exploration bonuses, and entropy regularisation all reach for the same dial: pushing a policy or a softmax toward the top of the curve keeps options open, pushing toward the ends commits.

04 · The wrong codebookCross-entropy is your training loss

Suppose data comes from p but you built your code (your model) for q. You pay q's codeword lengths, −log q(x), at p's frequencies:

H(p, q) = E_x∼p[−log q(x)] = H(p) + KL(p ∥ q)

unavoidablepenalty for the wrong book

This is exactly the loss a classifier trains on: the labels are samples from p (usually one-hot), the softmax output is q, and minimising −log q(label) is minimising the average code length your model assigns the truth. Since H(p) is fixed by the data, minimising cross-entropy is minimising the KL divergence — the wrong-book penalty is the only part you can reduce.

Perplexity, the language-modelling metric, is just exp(H(p, q)) per token: the effective number of equally likely choices the model is still hedging between. A perplexity of 20 means the model is, on average, as uncertain as a fair 20-sided die.

05 · OrientationWhere each quantity points

Quantity	Question it answers	Lives in
−log p(x)	How surprising was this one event?	Per-sample loss values
H(p)	How unpredictable is this source, irreducibly?	Noise floor; best achievable loss
H(p, q)	What do I pay using model q on data p?	The training objective
KL(p ∥ q)	How much of that payment was avoidable?	Its own note

Mental Model

Surprise = −log p: zero when certain, unbounded when rare, additive when independent.
Entropy = expected surprise = the shortest possible average code length (Shannon).
Uniform is maximal ignorance (log K); deterministic is silence (0).
Cross-entropy = coding p's data with q's codebook; the excess over H(p) is KL.
Training a classifier is compressing labels with the model's codebook and shortening the invoice.