General ML

Regularisation Methods

Buying bias to cut variance, in four currencies

01 · First principlesOne purpose, many disguises

A flexible model trained on a finite sample will spend part of its capacity fitting the noise in that particular sample — the variance term of the bias–variance decomposition. Regularisation is any device that restrains how freely the model can chase its training set. Every method in this note, however different it looks, does the same transaction: accept a little systematic error (bias) to suppress sensitivity to the sample (variance). There is no entry in the taxonomy that escapes paying; the craft is paying in the currency your problem feels least.

02 · PenaltiesCharge the weights rent

Add a term to the loss that charges for parameter magnitude: L(θ) + λΩ(θ). The two classic rents behave very differently at zero.

L2: Ω = ‖θ‖²2  →  gradient λθ  →  proportional shrinkage   |   L1: Ω = ‖θ‖1  →  gradient λ·sign(θ)  →  exact zeros

L2's pull is proportional to the weight, so it shrinks everything smoothly toward zero but never reaches it — small, distributed weights, smoother functions. L1's pull is constant regardless of size, so small weights get dragged exactly to zero and stay there: sparsity, free feature selection. The Bayesian reading makes the bias explicit: the penalty is a prior, and the regularised solution is the MAP estimate. L2 says "I believe weights are small" (Gaussian prior); L1 says "I believe most weights are exactly irrelevant" (Laplace prior). You are injecting a belief; that belief is the bias you bought. (For the optimiser-interaction fine print — L2 inside Adam is not weight decay — see AdamW.)

03 · NoiseMake memorisation a losing game

The second family corrupts the training signal so that fitting any one sample's quirks stops paying.

04 · Architecture & accidentBuilt-in and implicit regularisers

Architectural: constraints baked into the model family itself. Weight sharing in CNNs is the canonical case — declaring that the same filter applies at every spatial position collapses millions of free parameters into thousands, a hard prior of translation invariance. Hard priors are the strongest regularisers available, and the most biased: they are unbeatable when true (images) and crippling when false.

Implicit: regularisation nobody wrote down. Early stopping halts the optimiser before it can travel far enough from initialisation to fit the noise (for linear models it is provably ≈ an L2 penalty). The noise in SGD itself biases training toward flat minima, which tolerate the shift between training data and reality. Much of deep learning's generalisation comes from this unbilled category — part of why heavily overparameterised nets defy the naive capacity story.

05 · The taxonomyOne table

FamilyMethodMechanismThe bias you buy
PenaltyL2 / weight decayshrink all weights (Gaussian prior)smoother functions, small weights
L1drive weights to exact zero (Laplace prior)sparsity — most features assumed irrelevant
Noisedropoutrandom subnetworks → implicit ensembleno co-adapted features allowed
data augmentationtrain on label-preserving transformsthe declared invariances
label smoothingsoften one-hot targetscapped confidence
Architecturalweight sharing (CNNs)same filter everywhere — hard priortranslation invariance, true or not
Implicitearly stoppingbound distance from init (≈ L2)solutions near the start preferred
SGD noisekicked out of sharp minimaflat-minima preference
Reading the last column: every row names the bias purchased. If a method seems to reduce variance for free, you have not found its bias yet — diagnosis of whether you even need the purchase is the subject of overfitting / underfitting.
Mental Model