General ML

Loss Functions

Choose the noise model; the loss follows

01 · First principlesWhat a loss actually is

A model cannot be trained against "be correct"; it needs a number that says how much each kind of wrong costs. The loss is that price list. And because training is gradient descent, the loss's influence is entirely through its gradient: ∂L/∂(prediction) is the teaching signal that backprop carries into the network. Two losses that rank errors identically but shape their gradients differently train differently. So the real design questions are: what does this loss charge for large errors, and what gradient does it emit?

02 · The hidden assumptionEvery loss is a noise model

The principled route to a loss is maximum likelihood: assume how observations scatter around the model's prediction, then minimise negative log-likelihood. Each scatter assumption produces a familiar loss.

y = f(x) + ε,   ε ~ Gaussian  ⇒  −log p(y|x) = (y − f(x))²/2σ² + const  ⇒  MSE
ε ~ Laplace  ⇒  |y − f(x)|/b + const  ⇒  MAE
y ~ Categorical(softmax(z))  ⇒  −log py  ⇒  cross-entropy

Read the consequences off the noise. Gaussian tails are thin, so MSE charges quadratically and a single outlier at error 10 outweighs a hundred points at error 1 — the fit chases it. Laplace tails are heavy, so MAE charges linearly and shrugs at outliers (its minimiser is the conditional median, where MSE's is the mean). Choosing a loss is choosing what you believe about the noise, whether or not you do it consciously.

Huber is the diplomatic middle: quadratic inside a band δ (smooth, mean-like near the answer), linear outside (robust to outliers). One hyperparameter buys both behaviours.

03 · ClassificationCross-entropy and the clean gradient

For classification the noise model is categorical: the model emits logits z, softmax turns them into probabilities p, and we charge the negative log of the probability assigned to the true class. The reason this pairing is canonical appears when we differentiate the composition. With one-hot target y:

L = −Σk yk log pk,   p = softmax(z)
∂L/∂zk = pk − yk

The softmax Jacobian and the log's reciprocal cancel exactly, leaving the gradient "predicted probability minus truth" — large when confidently wrong, shrinking smoothly to zero as the prediction becomes correct. The teaching signal is proportional to the mistake at every confidence level.

Now try MSE on the same softmax outputs, L = Σ(pk − yk)². The chain rule no longer cancels; the gradient keeps a factor of the softmax derivative, which is ≈ 0 wherever the network is saturated and confident. A model that is confidently wrong emits almost no gradient — the worst possible place to stop teaching. Same predictions, same ranking of errors, broken signal. This is the cleanest illustration that the gradient, not the value, is the loss's real product. (Cross-entropy against the data distribution is also, up to a constant, the KL divergence from model to data — minimising one minimises the other.)

04 · The margin familyHinge

Not every loss is a likelihood. Hinge loss, max(0, 1 − y·s) for label y ∈ {−1,+1} and score s, comes from a geometric demand instead: be correct by a margin. Points classified beyond the margin contribute exactly zero loss and zero gradient — the model stops spending capacity on examples it has already handled, and the solution depends only on the borderline points (the support vectors of the SVM). Compare cross-entropy, which keeps charging (ever less) for every example forever. Hinge is convex (a maximum of affine pieces — see convex functions) but its indifference past the margin gives weaker probability estimates; it answers "which side?", not "how likely?".

05 · Side by sideThe price lists

LossNoise / principleGradient behaviourUse when
MSEGaussian MLE; minimiser = mean∝ error; outliers dominateclean regression targets
MAELaplace MLE; minimiser = medianconstant ±1; no shrink near 0heavy-tailed targets, outliers
HuberGaussian centre, Laplace tails∝ error near 0, capped far outregression with some outliers
Cross-entropycategorical MLEp − y through softmax; never saturates when wrongclassification, always the default
Hingemargin geometry, not probabilityzero past margin; sparseSVMs, margin-critical problems
Mental Model