What to measure when accuracy starts lying
A disease affects 1% of patients. Consider the classifier return "healthy" — no model, no features, a constant. Its accuracy is 99%. It also detects zero cases, which was the entire job. Accuracy averages over a population in which the interesting class is a rounding error, so a model can score superbly by ignoring the problem.
Every metric in this note is arithmetic on four counts. Get the counts straight and the metrics stop being vocabulary:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actually positive | TP — caught it | FN — missed it (the silent failure) |
| Actually negative | FP — false alarm | TN — correct rejection |
The two error types almost never cost the same. A missed cancer (FN) and an unnecessary biopsy (FP) are different tragedies; a blocked legitimate email and a delivered scam are different annoyances. Metric choice is really a statement about which cell you fear.
A scoring classifier becomes a decision rule only when you pick a threshold, and the threshold is a lever with these two on opposite ends. Lower it and you call more things positive: recall rises, precision falls. Raise it and you speak only when sure: precision rises, recall falls. Neither number means much without the other — recall 1.0 is trivially available by flagging everyone (the inverse of section 01's scam).
The F1 score compresses the pair via the harmonic mean:
Harmonic, not arithmetic, because the harmonic mean is dragged toward the smaller value: P = 1.0 with R = 0.02 gives F1 ≈ 0.04, not the flattering 0.51 an average would report. F1 punishes lopsidedness — you cannot buy it with one good number. (It still hides which side is weak, weights both equally — see Fβ otherwise — and ignores TN entirely.)
Rather than defend one threshold, sweep them all. For each threshold plot the true-positive rate (recall) against the false-positive rate FP/(FP+TN); the sweep traces the ROC curve.
Each point on a curve is one threshold. The diagonal is random guessing; better models bow toward the top-left corner.
The area under the curve has a clean probabilistic meaning, and it is the right way to remember AUC:
It is a pure ranking metric: 0.5 means the scores carry no order information, 1.0 means every positive outranks every negative. It is threshold-free and insensitive to calibration — which is both its virtue and its blind spot.
ROC's x-axis is FPR = FP/(FP+TN), and under heavy imbalance TN is astronomical. A fraud model that fires 10,000 false alarms against 10 million legitimate transactions has FPR = 0.1% — invisible on the ROC plot — while its precision may be a catastrophic 5%. The ROC curve looks immaculate because the negatives absorb any number of false positives into the denominator.
The precision–recall curve replaces FPR with precision, which has FP in a small denominator (TP+FP, the calls you actually made) and therefore feels every false alarm. Rule of thumb:
| Situation | Prefer | Why |
|---|---|---|
| Roughly balanced classes; both error types matter | ROC / AUC | Stable, threshold-free, comparable across datasets. |
| Heavy imbalance; the positive class is the point | PR curve / AUPRC | Precision exposes false-alarm cost that FPR hides; the chance baseline (= positive rate) keeps you honest. |
| One deployed operating point | P, R at that threshold (+ a CI) | Users experience a threshold, not a curve. |
The deeper pattern is the one from Bayes: precision is a posterior, P(truly positive | flagged), so it depends on the base rate; TPR and FPR are likelihoods, so it does not. Choose the metric that conditions the way your deployment does.