Trust the data alone, or hedge with a prior — and why regularisers are priors
You have data D and a model family p(D | θ). The maximum-likelihood principle is the most natural rule available: choose the parameters under which what you actually observed was most probable.
We take the log for two unglamorous reasons: independent likelihoods are products, and logs turn products into sums (differentiable term by term, parallelisable, the form every gradient framework expects); and a product of a million numbers below 1 underflows any float, while its log is a perfectly tame sum. The argmax is unchanged because log is monotone.
MLE has excellent asymptotic manners — consistent, efficient, and equivalent to minimising KL(data ∥ model), which is why cross-entropy training is MLE. Its failure mode is not asymptotic.
Flip a coin three times, observe HHH. The likelihood of heads-probability p is p³, maximised at:
MLE believes thin data with total conviction. It assigns probability zero to anything unseen (the smoothing problem in language models is this exact failure), and its variance explodes precisely when data is scarce — the regime where you most need an estimator you can trust. Nothing in the machinery represents the thought "three flips is not much evidence".
Bring in Bayes: treat θ as uncertain, give it a prior p(θ), and maximise the posterior instead. Since the evidence does not depend on θ:
One added term. For the coin, a mild Beta(2,2) prior ("coins are usually fair-ish") moves the estimate from 1.0 to 4/5 — still leaning heads, no longer certain. The prior is a rubber band anchored at your prior belief; the likelihood stretches it toward the data, and three flips do not stretch it far.
Three heads: the likelihood (red) peaks at the boundary; the prior (green) pulls the posterior (blue) back inside.
Take MAP = MLE + log-prior and plug in a Gaussian prior over weights, p(w) = N(0, σ²I):
A Laplace prior, p(w) ∝ exp(−|w|/b), gives log p(w) = −‖w‖₁/b + const — L1 / lasso, whose sharp peak at zero is why it produces exact sparsity. So the regularisation hyperparameter you tune by grid search is the variance of a belief: small λ = a loose prior ("weights may be anything"), large λ = a tight one ("weights are almost surely small"). Weight decay is not a hack bolted onto the loss; it is Bayesian inference with the posterior collapsed to its peak.
In the MAP objective, the log-likelihood is a sum of n terms and the log-prior is one term. As n grows, the sum grows linearly and the prior stays put, so its relative weight decays like 1/n:
| Regime | What dominates | Practical reading |
|---|---|---|
| Small n | Prior | Regularisation matters enormously; MLE is dangerous. |
| Large n | Likelihood | MLE and MAP nearly coincide; argue about priors less. |
| Any n, full posterior wanted | Neither point estimate | MAP keeps one point and discards uncertainty — Bayesian inference proper keeps the whole distribution. |