General ML

Bayes Theorem

Flipping the conditional you have into the one you want

01 · First principlesWe always have the wrong conditional

Nature hands us probabilities in one direction and we need them in the other. A test's datasheet gives P(positive | disease); the patient wants P(disease | positive). A model gives P(data | parameters); we want P(parameters | data). These are different numbers — sometimes wildly different — and confusing them has a name (the prosecutor's fallacy) and a body count of bad decisions.

The whole theorem is a flipping device: a rule for converting the conditional you can measure into the conditional you actually care about. The conversion fee is paid in prior knowledge.

02 · The derivationTwo lines from the definition

Conditional probability is defined as P(A | B) = P(A ∩ B) / P(B). Write the joint both ways and equate:

P(A ∩ B) = P(A | B)·P(B) = P(B | A)·P(A)
⇒  P(A | B) = P(B | A)·P(A) / P(B)

That is the entire proof — the theorem is a rearrangement of a definition, which is why it is beyond dispute. Everything contentious about Bayesian statistics lives in what you plug in, never in the plumbing.

03 · The anatomyPrior × likelihood, normalised

Rename for the inference setting — hypothesis h, observed data D:

P(h | D) = P(D | h) · P(h) / P(D)
likelihood prior evidence

The prior is what you believed before seeing data; the likelihood is how loudly the data argues for each hypothesis; the posterior is the updated belief. The evidence P(D) = Σh P(D|h)P(h) does not depend on h at all — it is the normaliser that makes the posterior sum to one, which is why working ML almost always writes:

posterior ∝ likelihood × prior

The hypothesis ranking never needs the denominator. (When you do need P(D) — model comparison, marginal likelihoods — it is usually the hard part.)

04 · The trapThe base rate eats the test

The canonical numbers everyone gets wrong. A disease affects 1 in 1,000 people. The test catches 99% of cases, with a 5% false-positive rate. You test positive. Most people — including, in published studies, most physicians — answer "about 95–99%". Run the machine:

P(+) = 0.99·0.001 + 0.05·0.999 = 0.00099 + 0.04995 ≈ 0.051
P(disease | +) = 0.00099 / 0.051 ≈ 0.019 — about 2%
1,000 PEOPLE, SCALED TO ONE BAR 1 TRUE CASE (TESTS +) ~50 FALSE POSITIVES ~949 TEST NEGATIVE NOW KEEP ONLY THE POSITIVES (~51 PEOPLE) 1 SICK 50 HEALTHY — YOUR POSITIVE IS PROBABLY ONE OF THESE

The 5% false-positive rate, applied to 999 healthy people, produces 50 false alarms for every true case.

The intuition once you see it: a positive result puts you in a room of 51 positive-testers, and 50 of them are healthy. The test is good; the disease is just so rare that false alarms from the enormous healthy majority swamp the single true case. A likelihood means nothing until multiplied by its base rate.

05 · In MLThe theorem at work

Habit worth keeping: whenever a model emits a confident P(class | input), ask what base rates the training data baked into the prior. A classifier trained on balanced data and deployed on imbalanced reality is the disease-test trap, automated — and the same arithmetic governs precision under class imbalance.
Mental Model