General ML

Bagging

Manufacture many datasets, average away the wobble

01 · First principlesThe problem: one dataset, one wobble

Some learners are good on average but unstable: train a fully grown decision tree on your data, then on a slightly perturbed copy, and you get two very different trees. In bias–variance language they have low bias and high variance — the average over hypothetical retrainings is accurate, but you only ever hold one draw from that distribution.

If we could train on many independent datasets and average the resulting models, the wobble would cancel and the accurate average would emerge. The problem forcing bagging to exist: we have exactly one dataset.

02 · The trickBootstrap: pseudo-datasets from one dataset

The bootstrap manufactures the missing datasets. Resample n points from your n-point dataset with replacement: some points appear twice or thrice, others not at all. Each resample is a plausible alternative dataset drawn from approximately the same distribution. Bootstrap aggregating:

  1. Draw B bootstrap samples D1 … DB from the training set.
  2. Train one high-variance model f̂b on each (deep trees, no pruning — keep the bias low and let the variance run).
  3. Predict by averaging (regression) or majority vote / averaged probabilities (classification).
bag(x) = (1/B) Σb=1..Bb(x)   ·   each point omitted from a given resample w.p. (1 − 1/n)ne−1 ≈ 0.37
DATASET D n points D₁ resample D₂ resample D_B resample SMOOTH MEAN B jagged fits → one calm average

Each bootstrap model overfits its own resample's noise; the noises differ, so the average keeps the signal.

03 · What it buysPure variance reduction, bias untouched

Averaging does not move the centre of the distribution it averages: E[f̂bag] is essentially E[f̂] (bootstrap draws are a slightly degraded stand-in for fresh data, but to first order the bias is unchanged). What averaging does move is the spread. By the variance-of-the-mean identity (derived properly in ensembles):

Var(f̂bag) = ρσ² + (1 − ρ)σ²/B
shared overfitting survivesindependent overfitting dies

Two consequences fall out. First, bagging only helps unstable learners: bagging a linear regression (low σ², high ρ across resamples) achieves nearly nothing, while bagging deep trees is transformative. Second, the choice of base learner is deliberate: use low-bias, high-variance members, because bagging fixes only the second disease. Boosting is the mirror-image bet for the first.

Free lunch, small but real: each model never saw ~37% of the data — its out-of-bag points. Score every training point using only the models that did not train on it and you get an honest generalisation estimate with no held-out set and no extra training. In practice OOB error tracks cross-validation closely.

04 · The upgradeRandom forests: decorrelate the trees

Bagged trees share a defect: they are all built from the same features by the same greedy criterion. If one feature is strongly predictive, every tree splits on it at the root, and the trees come out similar — ρ stays high, and the ρσ² term puts a floor under the ensemble that more trees cannot break.

Random forests attack ρ directly: at each split, the tree may only consider a random subset of features (√d is the classification default). The dominant feature is unavailable for many splits, so different trees are forced to discover different structure, and their errors decorrelate. Each individual tree gets slightly worse (σ² up a little, bias up a little); the ensemble gets better, because in σ²(ρ + (1−ρ)/B) the drop in ρ outweighs both.

MethodDiversity sourceWhat falls
Single deep treenothing (high variance)
Baggingbootstrap resampling(1−ρ)σ²/B term
Random forestresampling + per-split feature subsetsthe ρσ² floor itself

05 · In practiceKnobs and habits

Mental Model