The standard moves against overflow and cancellation
Nearly every numerical bug in ML code is one of two pathologies wearing a costume (the underlying machinery is in floating point):
The tricks below are not approximations. Each computes the same mathematical quantity along a different route — one that avoids visiting any intermediate value the format cannot hold. The cost is a few extra operations; the alternative is silent garbage.
Softmax: subtract the max
# naive: exp(z) overflows for z ≳ 89 in fp32
p = exp(z) / exp(z).sum()
# stable: shift by max(z) — identical result, exponents now ≤ 0
z = z - z.max()
p = exp(z) / exp(z).sum()
Softmax is shift-invariant — the e−max factors cancel in the ratio — so this is exact. After the shift the largest exponent is 0 and every term lies in (0, 1]; the only values lost are ones that underflow to 0, and those contributed nothing anyway. Every framework softmax does this internally; you re-meet the problem the day you write a custom kernel or a fused attention score.
Log-sum-exp: same shift, log domain
This is how cross-entropy is computed without ever materialising the probabilities — which is also why log_softmax + NLL (or logits straight into cross_entropy) is correct and log(softmax(z)) is the classic bug: the inner softmax underflows small probabilities to exactly 0, and log(0) = −inf.
Log-space probabilities
# naive: a product of 500 probabilities ~0.1 underflows fp64 around step 308
p_seq = prod(p_t)
# stable: sum the logs; compare, rank, and accumulate in log space
logp_seq = sum(log(p_t)) # combine alternatives with logsumexp, never exp-then-add
Long products of probabilities (sequence likelihoods, HMMs, beam search scores) leave representable range fast; their logs stay small and well-behaved. The rule generalises: stay in log space until the final moment a probability is genuinely needed.
Variance: never E[x²] − E[x]²
# naive: two big near-equal numbers subtracted — can even go negative
var = (x**2).mean() - x.mean()**2
# Welford: running mean, accumulate squared deviations from it
mean += (x_i - mean) / n
M2 += (x_i - mean) * (x_i - mean_prev) # var = M2 / n
For data with mean 1000 and standard deviation 1, the naive formula subtracts two numbers near 106 that agree in their first six digits — in fp32 that is all of them, and the "variance" comes out as rounding noise, sometimes negative. Welford's update sums small squared deviations directly; nothing large is ever cancelled. This is why normalisation statistics are computed in fp32 even in mixed-precision training, and in Welford form inside good layernorm kernels.
Kahan summation: carry the lost digits
s, c = 0, 0
for x in xs:
y = x - c # re-inject what the last add dropped
t = s + y
c = (t - s) - y # algebraically 0; in floats, the rounding error
s = t
Adding a small number to a large running sum drops the small number's low bits. Kahan computes exactly what was dropped and feeds it back, cutting error from O(n·eps) to O(eps) for roughly 4× the additions (often hidden under memory latency). The mainstream alternatives are simpler: accumulate in fp32/fp64, or sum pairwise as tree reductions on GPUs already do.
The eps in Adam's denominator looks like boilerplate, but its placement changes the optimizer:
When v̂ is tiny (flat directions, early steps), the two differ enormously: with ε = 10−8, inside gives a denominator of 10−4, outside gives 10−8 — a 10,000× difference in effective step size in exactly the directions where v carries no information. The published Adam uses outside; some frameworks and papers (and TensorFlow's epsilon at one point) differ, which is one reason "the same optimizer, the same hyperparameters" can fail to reproduce across frameworks. Treat eps as a real hyperparameter — large models often run it at 10−5 or even larger in bf16, precisely so the denominator stays out of the noise floor.
| Symptom | Suspect | Move |
|---|---|---|
| nan appears in attention or logits at long context / large scale | exp overflow | Max-subtraction; compute scores in fp32 |
| −inf loss on confident examples | log(softmax) | log_softmax / logsumexp on logits |
| Sequence scores all collapse to 0 | probability underflow | Log-space accumulation |
| Negative or jittery variance in a custom norm | cancellation | Welford, fp32 statistics |
| Metrics drift with reduction order / batch size | accumulated rounding | fp32 or Kahan/pairwise accumulation |
| Optimizer behaves differently across frameworks | eps placement or size | Check inside-vs-outside sqrt and the value of ε |