General ML

Ensembles

Many imperfect models, one good answer — if they disagree usefully

01 · First principlesWhy should averaging models work at all?

Ask a thousand people to guess the weight of an ox and the average guess is eerily accurate — provided their errors point in different directions, so that averaging cancels them. The same logic applies to models: each trained model is truth plus an error term, and if the error terms are not all the same error, combining models shrinks the error that remains.

The condition in italics is the entire subject. A committee of clones is just one model with extra compute. Ensembling works exactly to the degree that the members are individually decent and mutually diverse — competent enough to be right on average, different enough to be wrong in different places.

02 · The equationVariance of the mean, with correlation

Make it exact. Take n models whose predictions at a point each have variance σ² and pairwise correlation ρ (over redraws of training data — the variance of the bias–variance decomposition). The variance of their average is:

Var( (1/n) Σ f̂_i ) = ρσ² + (1 − ρ)σ² / n

correlated part — averaging cannot touch it independent part — dies as 1/n

Read it slowly, because this one line is the whole field. The independent share of the error vanishes as you add members. The correlated share survives no matter how many models you train: with ρ = 1 the ensemble equals one model; with ρ = 0 variance falls all the way to σ²/n. Every ensemble method ever invented is a scheme for pushing ρ down without pushing individual quality (σ², and bias) up too much.

Consequence: past a few hundred members, adding more identical-recipe models does almost nothing — the ρσ² floor has been reached. To improve further you must diversify differently, not multiply harder.

03 · Sources of diversityWhere low ρ comes from

Lever	Mechanism	Canonical method
Resample the data	Each model sees a different bootstrap draw, so each overfits different noise.	Bagging
Subsample the features	Models are forbidden from all leaning on the same dominant feature.	Random forests (bagging + per-split feature subsets)
Change the objective per member	Each model is trained on what the previous ones still get wrong — diversity by construction, aimed at bias rather than variance.	Boosting
Change the algorithm	A tree, a linear model, and a neural net have different inductive biases, hence decorrelated errors.	Heterogeneous ensembles, stacking
Change the randomness	Different seeds, init, augmentation, or checkpoints of one training run.	Deep ensembles, snapshot ensembles

σ²(ρ + (1−ρ)/n) versus n. The curve you are on is set by ρ; n only walks you down to its floor.

04 · CombiningVote, average, or learn the combination

For regression, average; for classification, vote on labels or (better) average predicted probabilities, which preserves calibrated uncertainty. Stacking goes one step further: treat member predictions as features and train a small meta-model — typically regularised logistic regression — to combine them. The one rule that matters: the meta-model must be trained on out-of-fold predictions, otherwise it learns to trust whichever member overfit hardest, and the stack overfits at the second level.

The two great families are worth keeping mentally orthogonal: bagging combines deep, low-bias, high-variance learners in parallel to cancel variance; boosting combines shallow, high-bias learners in sequence to grind down bias. Same word "ensemble," opposite diagnoses.

05 · The billWhat ensembling costs

Compute and latency: n models means roughly n× training and n× inference. In low-latency serving this is often the deciding argument against; distillation (train one model to mimic the ensemble) buys most of the accuracy back at single-model cost.
Interpretability: one decision tree is an explanation; five hundred of them are not. You trade the readable rule for a feature-importance histogram.
Diminishing returns: the ρσ² floor again. Ensembling is the most reliable few-percent improvement in ML, and almost never more than that.

Mental Model

Ensemble variance = σ²(ρ + (1−ρ)/n): the independent part of the error dies as 1/n, the correlated part never dies.
Therefore diversity (low ρ) is everything; member count only walks you down to the ρσ² floor.
Diversity levers: resample data (bagging), subsample features (forests), retarget objectives (boosting), change algorithms (stacking).
Bagging is parallel variance reduction; boosting is sequential bias reduction — opposite cures for opposite diseases.
The price is compute and interpretability; distill when serving costs bite.