Many imperfect models, one good answer — if they disagree usefully
Ask a thousand people to guess the weight of an ox and the average guess is eerily accurate — provided their errors point in different directions, so that averaging cancels them. The same logic applies to models: each trained model is truth plus an error term, and if the error terms are not all the same error, combining models shrinks the error that remains.
The condition in italics is the entire subject. A committee of clones is just one model with extra compute. Ensembling works exactly to the degree that the members are individually decent and mutually diverse — competent enough to be right on average, different enough to be wrong in different places.
Make it exact. Take n models whose predictions at a point each have variance σ² and pairwise correlation ρ (over redraws of training data — the variance of the bias–variance decomposition). The variance of their average is:
Read it slowly, because this one line is the whole field. The independent share of the error vanishes as you add members. The correlated share survives no matter how many models you train: with ρ = 1 the ensemble equals one model; with ρ = 0 variance falls all the way to σ²/n. Every ensemble method ever invented is a scheme for pushing ρ down without pushing individual quality (σ², and bias) up too much.
| Lever | Mechanism | Canonical method |
|---|---|---|
| Resample the data | Each model sees a different bootstrap draw, so each overfits different noise. | Bagging |
| Subsample the features | Models are forbidden from all leaning on the same dominant feature. | Random forests (bagging + per-split feature subsets) |
| Change the objective per member | Each model is trained on what the previous ones still get wrong — diversity by construction, aimed at bias rather than variance. | Boosting |
| Change the algorithm | A tree, a linear model, and a neural net have different inductive biases, hence decorrelated errors. | Heterogeneous ensembles, stacking |
| Change the randomness | Different seeds, init, augmentation, or checkpoints of one training run. | Deep ensembles, snapshot ensembles |
σ²(ρ + (1−ρ)/n) versus n. The curve you are on is set by ρ; n only walks you down to its floor.
For regression, average; for classification, vote on labels or (better) average predicted probabilities, which preserves calibrated uncertainty. Stacking goes one step further: treat member predictions as features and train a small meta-model — typically regularised logistic regression — to combine them. The one rule that matters: the meta-model must be trained on out-of-fold predictions, otherwise it learns to trust whichever member overfit hardest, and the stack overfits at the second level.
The two great families are worth keeping mentally orthogonal: bagging combines deep, low-bias, high-variance learners in parallel to cancel variance; boosting combines shallow, high-bias learners in sequence to grind down bias. Same word "ensemble," opposite diagnoses.