General ML

Cross Validation

Estimating generalisation without wasting the data

01 · First principlesThe problem with one split

To compare models or tune hyperparameters we need a number: expected error on unseen data. The naive estimator is a single train/test split. It has two defects, and they pull in opposite directions. Hold out 20% and you have wasted a fifth of your data — the model you evaluate is trained on less than the model you will ship. Hold out less and the estimate gets noisy: a small test set is one roll of the dice, and which particular points landed in it can swing the measured error by whole percentage points. With one split you are reading a single noisy sample of a random variable and calling it the mean. Decisions made on such numbers — "model A beats model B by 0.4%" — are frequently decisions about noise.

02 · The fixk-fold: every point validates exactly once

Cut the data into k equal folds. Train k times, each time holding out a different fold for validation and training on the other k−1. Average the k scores.

CVk = (1/k) Σj=1k error( model trained on D \ foldj, evaluated on foldj )

Both defects are addressed at once: every point is used for training (in k−1 of the runs) and for validation (in exactly one), so nothing is wasted; and the average of k scores has far lower variance than any single score. The spread of the k scores is a bonus — a free error bar on the estimate, which one split never gives you. The price is k× the training compute, which is why k = 5 or 10 is standard for classical models and why deep learning, where one training run is expensive, mostly falls back to a single (large) validation split and accepts the noise.

5-FOLD · EACH ROW IS ONE TRAINING RUN run 1 run 2 run 3 run 4 run 5 ■ train   ■ validate  → average the 5 scores

The validation fold marches across the runs; every point validates exactly once and trains four times.

03 · VariantsStratified, LOO, nested

04 · The cardinal sinsHow CV lies to you

SinWhat happensThe rule
Preprocessing leakage Standardisation, feature selection, whitening, or imputation fit on all data before splitting lets the validation fold's statistics leak into training. Selecting features on the full data then "validating" can make pure noise score brilliantly. Fit every preprocessing step inside the fold, on the training portion only (pipelines exist for exactly this).
Tuning on the test set Evaluate on the held-out test set repeatedly while tuning, and you have promoted it to a validation set; the final number is an optimistic fiction. The test set is touched once, after every decision is frozen. CV is where the iteration happens.
Shuffling temporal data Random folds let the model train on the future and validate on the past — backtesting a forecaster with tomorrow's newspaper. Scores collapse on deployment. Time-series splits: expanding or rolling windows, train always strictly before validate; group-aware folds for repeated entities (same patient, same user) for the same reason.
One sentence covers all three: the validation fold must contain no information that was available, in any form, when its model was built. Every sin is a violation of that sentence; the gap-based diagnostics of overfitting / underfitting are only as honest as this hygiene.
Mental Model