A procedure with a guarantee — not a probability about the truth
Your model scores 84.3% on the test set. Retrain with a different seed: 83.1%. Different test split: 85.0%. The number you reported was one draw from a distribution of possible numbers, and reporting it alone hides everything about how far it might wander. The honest object is the estimate plus its wobble — and the confidence interval is the standard packaging for that wobble.
A 95% confidence interval is the output of a procedure with this property: across repeated experiments, 95% of the intervals the procedure constructs contain the true value. The randomness is in the interval — the truth is a fixed number that does not move.
This forces the famous reading that everyone resists: once your particular interval [83.9, 85.1] is computed, it either contains the truth or it does not. The statement "there is a 95% probability the truth lies in [83.9, 85.1]" is, under the frequentist rules that built the interval, not even well formed. What you may say: "this interval came from a process that captures the truth 95% of the time."
Each line is the interval from one repetition of the experiment. The truth never moves; the intervals do. About 1 in 20 misses.
Where does the recipe come from? The central limit theorem: a mean of n independent samples is approximately Gaussian around the truth, with standard deviation σ/√n (variance bookkeeping from the variance note). A Gaussian keeps 95% of its mass within 1.96 standard deviations, so:
For a test-set accuracy p̂ on n examples, each prediction is a Bernoulli trial, so SE = √(p̂(1−p̂)/n). Concretely: 90% accuracy on n = 1,000 gives SE ≈ 0.95%, hence roughly ±1.9% — your "90.0%" is statistically indistinguishable from a competitor's 88.5%. The √n in the denominator is the quiet tyrant: halving the error bar costs four times the data.
The z·SE recipe needs a formula for the standard error, and for most quantities you actually report — F1, AUC, BLEU, a median, a ratio of metrics — no clean formula exists. The bootstrap replaces the formula with simulation: treat the sample as a stand-in for the population, and replay the experiment by resampling.
It is crude, embarrassingly parallel, assumption-light, and the default answer to "how do I get error bars on this weird metric". Its main failure modes are tiny samples and statistics dominated by extremes.
| Claim | Without a CI | With a CI |
|---|---|---|
| "Our model beats the baseline by 0.8%" | Possibly seed noise | Checkable: do the intervals overlap? Better, bootstrap the difference. |
| "New method: 76.2 on the benchmark" | One draw from the seed lottery | Mean ± CI over ≥3–5 seeds; seed variance is part of the result. |
| "Ablation X hurts performance" | Could reverse on a different split | Effect size with uncertainty, not a coin-flip ranking. |