General ML

Bayes Theorem

Flipping the conditional you have into the one you want

01 · First principlesWe always have the wrong conditional

Nature hands us probabilities in one direction and we need them in the other. A test's datasheet gives P(positive | disease); the patient wants P(disease | positive). A model gives P(data | parameters); we want P(parameters | data). These are different numbers — sometimes wildly different — and confusing them has a name (the prosecutor's fallacy) and a body count of bad decisions.

The whole theorem is a flipping device: a rule for converting the conditional you can measure into the conditional you actually care about. The conversion fee is paid in prior knowledge.

02 · The derivationTwo lines from the definition

Conditional probability is defined as P(A | B) = P(A ∩ B) / P(B). Write the joint both ways and equate:

P(A ∩ B) = P(A | B)·P(B) = P(B | A)·P(A)
⇒ P(A | B) = P(B | A)·P(A) / P(B)

That is the entire proof — the theorem is a rearrangement of a definition, which is why it is beyond dispute. Everything contentious about Bayesian statistics lives in what you plug in, never in the plumbing.

03 · The anatomyPrior × likelihood, normalised

Rename for the inference setting — hypothesis h, observed data D:

P(h | D) = P(D | h) · P(h) / P(D)

likelihood prior evidence

The prior is what you believed before seeing data; the likelihood is how loudly the data argues for each hypothesis; the posterior is the updated belief. The evidence P(D) = Σ_h P(D|h)P(h) does not depend on h at all — it is the normaliser that makes the posterior sum to one, which is why working ML almost always writes:

posterior ∝ likelihood × prior

The hypothesis ranking never needs the denominator. (When you do need P(D) — model comparison, marginal likelihoods — it is usually the hard part.)

04 · The trapThe base rate eats the test

The canonical numbers everyone gets wrong. A disease affects 1 in 1,000 people. The test catches 99% of cases, with a 5% false-positive rate. You test positive. Most people — including, in published studies, most physicians — answer "about 95–99%". Run the machine:

P(+) = 0.99·0.001 + 0.05·0.999 = 0.00099 + 0.04995 ≈ 0.051
P(disease | +) = 0.00099 / 0.051 ≈ 0.019 — about 2%

The 5% false-positive rate, applied to 999 healthy people, produces 50 false alarms for every true case.

The intuition once you see it: a positive result puts you in a room of 51 positive-testers, and 50 of them are healthy. The test is good; the disease is just so rare that false alarms from the enormous healthy majority swamp the single true case. A likelihood means nothing until multiplied by its base rate.

05 · In MLThe theorem at work

MAP estimation: maximise posterior ∝ likelihood × prior; the log-prior surfaces as your regulariser. MLE is the special case of a flat prior.
Naive Bayes: apply the theorem with a deliberately false assumption (features independent given the class) so the likelihood factorises into cheap per-feature terms. Wrong model, surprisingly durable classifier.
Iterated updating: today's posterior is tomorrow's prior. Bayesian inference is the same multiplication applied repeatedly as data streams in — beliefs as a running product, renormalised. This is the cleanest mental model for filtering (Kalman and friends) and for online learning generally.

Habit worth keeping: whenever a model emits a confident P(class | input), ask what base rates the training data baked into the prior. A classifier trained on balanced data and deployed on imbalanced reality is the disease-test trap, automated — and the same arithmetic governs precision under class imbalance.

Mental Model

Bayes is a flipping device: P(B|A) in, P(A|B) out, priced by the prior.
Two lines from the definition of conditional probability — the math is never the controversial part.
Posterior ∝ likelihood × prior; the evidence is bookkeeping that cancels in any argmax.
Rare condition + decent test = mostly false alarms. Always multiply by the base rate.
Inference is updating: posterior today, prior tomorrow.