General ML

Cross Validation

Estimating generalisation without wasting the data

01 · First principlesThe problem with one split

To compare models or tune hyperparameters we need a number: expected error on unseen data. The naive estimator is a single train/test split. It has two defects, and they pull in opposite directions. Hold out 20% and you have wasted a fifth of your data — the model you evaluate is trained on less than the model you will ship. Hold out less and the estimate gets noisy: a small test set is one roll of the dice, and which particular points landed in it can swing the measured error by whole percentage points. With one split you are reading a single noisy sample of a random variable and calling it the mean. Decisions made on such numbers — "model A beats model B by 0.4%" — are frequently decisions about noise.

02 · The fixk-fold: every point validates exactly once

Cut the data into k equal folds. Train k times, each time holding out a different fold for validation and training on the other k−1. Average the k scores.

CV_k = (1/k) Σ_j=1^k error( model trained on D \ fold_j, evaluated on fold_j )

Both defects are addressed at once: every point is used for training (in k−1 of the runs) and for validation (in exactly one), so nothing is wasted; and the average of k scores has far lower variance than any single score. The spread of the k scores is a bonus — a free error bar on the estimate, which one split never gives you. The price is k× the training compute, which is why k = 5 or 10 is standard for classical models and why deep learning, where one training run is expensive, mostly falls back to a single (large) validation split and accepts the noise.

The validation fold marches across the runs; every point validates exactly once and trains four times.

03 · VariantsStratified, LOO, nested

Stratified k-fold: force each fold to preserve the class proportions of the whole dataset. With imbalanced classes a random fold can contain almost no positives, making its score meaningless; stratification is the default for classification, costing nothing.
Leave-one-out (k = n): the extreme — train n times on n−1 points. Nearly unbiased (each model sees almost all the data), but n trainings is usually absurd, and the n models are so similar that the estimate's variance is not as low as the effort suggests. Worth it only for very small datasets or models with closed-form refits.
Nested CV: when CV both tunes hyperparameters and reports final performance, the report is contaminated — the winning configuration was chosen because it scored well on these folds. Nest an inner CV loop for tuning inside an outer loop for assessment; the outer score is then honest.

04 · The cardinal sinsHow CV lies to you

Sin	What happens	The rule
Preprocessing leakage	Standardisation, feature selection, whitening, or imputation fit on all data before splitting lets the validation fold's statistics leak into training. Selecting features on the full data then "validating" can make pure noise score brilliantly.	Fit every preprocessing step inside the fold, on the training portion only (pipelines exist for exactly this).
Tuning on the test set	Evaluate on the held-out test set repeatedly while tuning, and you have promoted it to a validation set; the final number is an optimistic fiction.	The test set is touched once, after every decision is frozen. CV is where the iteration happens.
Shuffling temporal data	Random folds let the model train on the future and validate on the past — backtesting a forecaster with tomorrow's newspaper. Scores collapse on deployment.	Time-series splits: expanding or rolling windows, train always strictly before validate; group-aware folds for repeated entities (same patient, same user) for the same reason.

One sentence covers all three: the validation fold must contain no information that was available, in any form, when its model was built. Every sin is a violation of that sentence; the gap-based diagnostics of overfitting / underfitting are only as honest as this hygiene.

Mental Model

A single split is one noisy sample of the quantity you care about; k-fold averages k of them and wastes nothing.
Every point validates exactly once, trains k−1 times; the fold scores' spread is a free error bar.
Stratify for classification; leave-one-out only when data is tiny; nest the loops when CV both tunes and reports.
Fit all preprocessing inside the fold — leakage is the silent killer of honest numbers.
The test set is read once. Temporal data validates only on its future.