General ML

Loss Functions

Choose the noise model; the loss follows

01 · First principlesWhat a loss actually is

A model cannot be trained against "be correct"; it needs a number that says how much each kind of wrong costs. The loss is that price list. And because training is gradient descent, the loss's influence is entirely through its gradient: ∂L/∂(prediction) is the teaching signal that backprop carries into the network. Two losses that rank errors identically but shape their gradients differently train differently. So the real design questions are: what does this loss charge for large errors, and what gradient does it emit?

02 · The hidden assumptionEvery loss is a noise model

The principled route to a loss is maximum likelihood: assume how observations scatter around the model's prediction, then minimise negative log-likelihood. Each scatter assumption produces a familiar loss.

y = f(x) + ε, ε ~ Gaussian ⇒ −log p(y|x) = (y − f(x))²/2σ² + const ⇒ MSE
ε ~ Laplace ⇒ |y − f(x)|/b + const ⇒ MAE
y ~ Categorical(softmax(z)) ⇒ −log p_y ⇒ cross-entropy

Read the consequences off the noise. Gaussian tails are thin, so MSE charges quadratically and a single outlier at error 10 outweighs a hundred points at error 1 — the fit chases it. Laplace tails are heavy, so MAE charges linearly and shrugs at outliers (its minimiser is the conditional median, where MSE's is the mean). Choosing a loss is choosing what you believe about the noise, whether or not you do it consciously.

Huber is the diplomatic middle: quadratic inside a band δ (smooth, mean-like near the answer), linear outside (robust to outliers). One hyperparameter buys both behaviours.

03 · ClassificationCross-entropy and the clean gradient

For classification the noise model is categorical: the model emits logits z, softmax turns them into probabilities p, and we charge the negative log of the probability assigned to the true class. The reason this pairing is canonical appears when we differentiate the composition. With one-hot target y:

L = −Σ_k y_k log p_k, p = softmax(z)
∂L/∂z_k = p_k − y_k

The softmax Jacobian and the log's reciprocal cancel exactly, leaving the gradient "predicted probability minus truth" — large when confidently wrong, shrinking smoothly to zero as the prediction becomes correct. The teaching signal is proportional to the mistake at every confidence level.

Now try MSE on the same softmax outputs, L = Σ(p_k − y_k)². The chain rule no longer cancels; the gradient keeps a factor of the softmax derivative, which is ≈ 0 wherever the network is saturated and confident. A model that is confidently wrong emits almost no gradient — the worst possible place to stop teaching. Same predictions, same ranking of errors, broken signal. This is the cleanest illustration that the gradient, not the value, is the loss's real product. (Cross-entropy against the data distribution is also, up to a constant, the KL divergence from model to data — minimising one minimises the other.)

04 · The margin familyHinge

Not every loss is a likelihood. Hinge loss, max(0, 1 − y·s) for label y ∈ {−1,+1} and score s, comes from a geometric demand instead: be correct by a margin. Points classified beyond the margin contribute exactly zero loss and zero gradient — the model stops spending capacity on examples it has already handled, and the solution depends only on the borderline points (the support vectors of the SVM). Compare cross-entropy, which keeps charging (ever less) for every example forever. Hinge is convex (a maximum of affine pieces — see convex functions) but its indifference past the margin gives weaker probability estimates; it answers "which side?", not "how likely?".

05 · Side by sideThe price lists

Loss	Noise / principle	Gradient behaviour	Use when
MSE	Gaussian MLE; minimiser = mean	∝ error; outliers dominate	clean regression targets
MAE	Laplace MLE; minimiser = median	constant ±1; no shrink near 0	heavy-tailed targets, outliers
Huber	Gaussian centre, Laplace tails	∝ error near 0, capped far out	regression with some outliers
Cross-entropy	categorical MLE	p − y through softmax; never saturates when wrong	classification, always the default
Hinge	margin geometry, not probability	zero past margin; sparse	SVMs, margin-critical problems

Mental Model

A loss is a price list for being wrong; its gradient is the actual teaching signal.
Every standard loss is maximum likelihood under some noise: Gaussian → MSE, Laplace → MAE, categorical → cross-entropy.
So choose the noise model first — the loss is then derived, not picked from a menu.
CE + softmax cancels to gradient p − y (loud when confidently wrong); MSE on softmax saturates exactly there.
Hinge trades probability for margins: zero gradient past the margin, the boundary defined by the hard cases.