Estimating generalisation without wasting the data
To compare models or tune hyperparameters we need a number: expected error on unseen data. The naive estimator is a single train/test split. It has two defects, and they pull in opposite directions. Hold out 20% and you have wasted a fifth of your data — the model you evaluate is trained on less than the model you will ship. Hold out less and the estimate gets noisy: a small test set is one roll of the dice, and which particular points landed in it can swing the measured error by whole percentage points. With one split you are reading a single noisy sample of a random variable and calling it the mean. Decisions made on such numbers — "model A beats model B by 0.4%" — are frequently decisions about noise.
Cut the data into k equal folds. Train k times, each time holding out a different fold for validation and training on the other k−1. Average the k scores.
Both defects are addressed at once: every point is used for training (in k−1 of the runs) and for validation (in exactly one), so nothing is wasted; and the average of k scores has far lower variance than any single score. The spread of the k scores is a bonus — a free error bar on the estimate, which one split never gives you. The price is k× the training compute, which is why k = 5 or 10 is standard for classical models and why deep learning, where one training run is expensive, mostly falls back to a single (large) validation split and accepts the noise.
The validation fold marches across the runs; every point validates exactly once and trains four times.
| Sin | What happens | The rule |
|---|---|---|
| Preprocessing leakage | Standardisation, feature selection, whitening, or imputation fit on all data before splitting lets the validation fold's statistics leak into training. Selecting features on the full data then "validating" can make pure noise score brilliantly. | Fit every preprocessing step inside the fold, on the training portion only (pipelines exist for exactly this). |
| Tuning on the test set | Evaluate on the held-out test set repeatedly while tuning, and you have promoted it to a validation set; the final number is an optimistic fiction. | The test set is touched once, after every decision is frozen. CV is where the iteration happens. |
| Shuffling temporal data | Random folds let the model train on the future and validate on the past — backtesting a forecaster with tomorrow's newspaper. Scores collapse on deployment. | Time-series splits: expanding or rolling windows, train always strictly before validate; group-aware folds for repeated entities (same patient, same user) for the same reason. |