General ML

Precision / Recall / F1 / AUC-ROC

What to measure when accuracy starts lying

01 · The failureThe 99%-accurate test that never says cancer

A disease affects 1% of patients. Consider the classifier return "healthy" — no model, no features, a constant. Its accuracy is 99%. It also detects zero cases, which was the entire job. Accuracy averages over a population in which the interesting class is a rounding error, so a model can score superbly by ignoring the problem.

The lesson generalises: any single number that pools both classes inherits the imbalance. Before trusting a metric, ask what the dumbest baseline scores on it.

02 · Ground truthThe confusion matrix comes first

Every metric in this note is arithmetic on four counts. Get the counts straight and the metrics stop being vocabulary:

	Predicted positive	Predicted negative
Actually positive	TP — caught it	FN — missed it (the silent failure)
Actually negative	FP — false alarm	TN — correct rejection

The two error types almost never cost the same. A missed cancer (FN) and an unnecessary biopsy (FP) are different tragedies; a blocked legitimate email and a delivered scam are different annoyances. Metric choice is really a statement about which cell you fear.

03 · The pairPrecision and recall, and the threshold between them

Precision = TP / (TP + FP)

Trust in positive calls. When the model raises its hand, how often is it right? The metric of false-alarm cost: spam filters, automated actions, anything where crying wolf is expensive.

Recall = TP / (TP + FN)

Coverage of real positives. Of everything that was truly there, how much did we find? The metric of miss cost: cancer screening, fraud detection, safety filters.

A scoring classifier becomes a decision rule only when you pick a threshold, and the threshold is a lever with these two on opposite ends. Lower it and you call more things positive: recall rises, precision falls. Raise it and you speak only when sure: precision rises, recall falls. Neither number means much without the other — recall 1.0 is trivially available by flagging everyone (the inverse of section 01's scam).

The F1 score compresses the pair via the harmonic mean:

F1 = 2·P·R / (P + R)

Harmonic, not arithmetic, because the harmonic mean is dragged toward the smaller value: P = 1.0 with R = 0.02 gives F1 ≈ 0.04, not the flattering 0.51 an average would report. F1 punishes lopsidedness — you cannot buy it with one good number. (It still hides which side is weak, weights both equally — see F_β otherwise — and ignores TN entirely.)

04 · All thresholds at onceROC and AUC

Rather than defend one threshold, sweep them all. For each threshold plot the true-positive rate (recall) against the false-positive rate FP/(FP+TN); the sweep traces the ROC curve.

Each point on a curve is one threshold. The diagonal is random guessing; better models bow toward the top-left corner.

The area under the curve has a clean probabilistic meaning, and it is the right way to remember AUC:

AUC = P( score(random positive) > score(random negative) )

It is a pure ranking metric: 0.5 means the scores carry no order information, 1.0 means every positive outranks every negative. It is threshold-free and insensitive to calibration — which is both its virtue and its blind spot.

05 · The fine printWhen PR curves beat ROC

ROC's x-axis is FPR = FP/(FP+TN), and under heavy imbalance TN is astronomical. A fraud model that fires 10,000 false alarms against 10 million legitimate transactions has FPR = 0.1% — invisible on the ROC plot — while its precision may be a catastrophic 5%. The ROC curve looks immaculate because the negatives absorb any number of false positives into the denominator.

The precision–recall curve replaces FPR with precision, which has FP in a small denominator (TP+FP, the calls you actually made) and therefore feels every false alarm. Rule of thumb:

Situation	Prefer	Why
Roughly balanced classes; both error types matter	ROC / AUC	Stable, threshold-free, comparable across datasets.
Heavy imbalance; the positive class is the point	PR curve / AUPRC	Precision exposes false-alarm cost that FPR hides; the chance baseline (= positive rate) keeps you honest.
One deployed operating point	P, R at that threshold (+ a CI)	Users experience a threshold, not a curve.

The deeper pattern is the one from Bayes: precision is a posterior, P(truly positive | flagged), so it depends on the base rate; TPR and FPR are likelihoods, so it does not. Choose the metric that conditions the way your deployment does.

Mental Model

Accuracy is a population average; imbalance lets a useless model ace it.
Four counts first — every metric is arithmetic on the confusion matrix.
Precision = trust in the model's positive calls; recall = coverage of the truth; the threshold trades one for the other.
F1 = harmonic mean: dragged toward the weaker of the pair, so it cannot be gamed one-sided.
AUC = probability a random positive outranks a random negative; under heavy imbalance, read the PR curve instead.