Choose the noise model; the loss follows
A model cannot be trained against "be correct"; it needs a number that says how much each kind of wrong costs. The loss is that price list. And because training is gradient descent, the loss's influence is entirely through its gradient: ∂L/∂(prediction) is the teaching signal that backprop carries into the network. Two losses that rank errors identically but shape their gradients differently train differently. So the real design questions are: what does this loss charge for large errors, and what gradient does it emit?
The principled route to a loss is maximum likelihood: assume how observations scatter around the model's prediction, then minimise negative log-likelihood. Each scatter assumption produces a familiar loss.
Read the consequences off the noise. Gaussian tails are thin, so MSE charges quadratically and a single outlier at error 10 outweighs a hundred points at error 1 — the fit chases it. Laplace tails are heavy, so MAE charges linearly and shrugs at outliers (its minimiser is the conditional median, where MSE's is the mean). Choosing a loss is choosing what you believe about the noise, whether or not you do it consciously.
For classification the noise model is categorical: the model emits logits z, softmax turns them into probabilities p, and we charge the negative log of the probability assigned to the true class. The reason this pairing is canonical appears when we differentiate the composition. With one-hot target y:
The softmax Jacobian and the log's reciprocal cancel exactly, leaving the gradient "predicted probability minus truth" — large when confidently wrong, shrinking smoothly to zero as the prediction becomes correct. The teaching signal is proportional to the mistake at every confidence level.
Now try MSE on the same softmax outputs, L = Σ(pk − yk)². The chain rule no longer cancels; the gradient keeps a factor of the softmax derivative, which is ≈ 0 wherever the network is saturated and confident. A model that is confidently wrong emits almost no gradient — the worst possible place to stop teaching. Same predictions, same ranking of errors, broken signal. This is the cleanest illustration that the gradient, not the value, is the loss's real product. (Cross-entropy against the data distribution is also, up to a constant, the KL divergence from model to data — minimising one minimises the other.)
Not every loss is a likelihood. Hinge loss, max(0, 1 − y·s) for label y ∈ {−1,+1} and score s, comes from a geometric demand instead: be correct by a margin. Points classified beyond the margin contribute exactly zero loss and zero gradient — the model stops spending capacity on examples it has already handled, and the solution depends only on the borderline points (the support vectors of the SVM). Compare cross-entropy, which keeps charging (ever less) for every example forever. Hinge is convex (a maximum of affine pieces — see convex functions) but its indifference past the margin gives weaker probability estimates; it answers "which side?", not "how likely?".
| Loss | Noise / principle | Gradient behaviour | Use when |
|---|---|---|---|
| MSE | Gaussian MLE; minimiser = mean | ∝ error; outliers dominate | clean regression targets |
| MAE | Laplace MLE; minimiser = median | constant ±1; no shrink near 0 | heavy-tailed targets, outliers |
| Huber | Gaussian centre, Laplace tails | ∝ error near 0, capped far out | regression with some outliers |
| Cross-entropy | categorical MLE | p − y through softmax; never saturates when wrong | classification, always the default |
| Hinge | margin geometry, not probability | zero past margin; sparse | SVMs, margin-critical problems |