Flipping the conditional you have into the one you want
01 · First principlesWe always have the wrong conditional
Nature hands us probabilities in one direction and we need them in the other. A test's datasheet gives P(positive | disease); the patient wants P(disease | positive). A model gives P(data | parameters); we want P(parameters | data). These are different numbers — sometimes wildly different — and confusing them has a name (the prosecutor's fallacy) and a body count of bad decisions.
The whole theorem is a flipping device: a rule for converting the conditional you can measure into the conditional you actually care about. The conversion fee is paid in prior knowledge.
02 · The derivationTwo lines from the definition
Conditional probability is defined as P(A | B) = P(A ∩ B) / P(B). Write the joint both ways and equate:
That is the entire proof — the theorem is a rearrangement of a definition, which is why it is beyond dispute. Everything contentious about Bayesian statistics lives in what you plug in, never in the plumbing.
03 · The anatomyPrior × likelihood, normalised
Rename for the inference setting — hypothesis h, observed data D:
P(h | D) = P(D | h) · P(h) / P(D)
likelihoodpriorevidence
The prior is what you believed before seeing data; the likelihood is how loudly the data argues for each hypothesis; the posterior is the updated belief. The evidence P(D) = Σh P(D|h)P(h) does not depend on h at all — it is the normaliser that makes the posterior sum to one, which is why working ML almost always writes:
posterior ∝ likelihood × prior
The hypothesis ranking never needs the denominator. (When you do need P(D) — model comparison, marginal likelihoods — it is usually the hard part.)
04 · The trapThe base rate eats the test
The canonical numbers everyone gets wrong. A disease affects 1 in 1,000 people. The test catches 99% of cases, with a 5% false-positive rate. You test positive. Most people — including, in published studies, most physicians — answer "about 95–99%". Run the machine:
The 5% false-positive rate, applied to 999 healthy people, produces 50 false alarms for every true case.
The intuition once you see it: a positive result puts you in a room of 51 positive-testers, and 50 of them are healthy. The test is good; the disease is just so rare that false alarms from the enormous healthy majority swamp the single true case. A likelihood means nothing until multiplied by its base rate.
05 · In MLThe theorem at work
MAP estimation: maximise posterior ∝ likelihood × prior; the log-prior surfaces as your regulariser. MLE is the special case of a flat prior.
Naive Bayes: apply the theorem with a deliberately false assumption (features independent given the class) so the likelihood factorises into cheap per-feature terms. Wrong model, surprisingly durable classifier.
Iterated updating: today's posterior is tomorrow's prior. Bayesian inference is the same multiplication applied repeatedly as data streams in — beliefs as a running product, renormalised. This is the cleanest mental model for filtering (Kalman and friends) and for online learning generally.
Habit worth keeping: whenever a model emits a confident P(class | input), ask what base rates the training data baked into the prior. A classifier trained on balanced data and deployed on imbalanced reality is the disease-test trap, automated — and the same arithmetic governs precision under class imbalance.
Mental Model
Bayes is a flipping device: P(B|A) in, P(A|B) out, priced by the prior.
Two lines from the definition of conditional probability — the math is never the controversial part.
Posterior ∝ likelihood × prior; the evidence is bookkeeping that cancels in any argmax.
Rare condition + decent test = mostly false alarms. Always multiply by the base rate.
Inference is updating: posterior today, prior tomorrow.