Applied ML

Numerical Precision Tricks

The standard moves against overflow and cancellation

01 · First principlesTwo enemies, many disguises

Nearly every numerical bug in ML code is one of two pathologies wearing a costume (the underlying machinery is in floating point):

Overflow in exp
ex exceeds fp32's range at x ≈ 89, fp16's at x ≈ 11. Attention logits and classifier logits reach those values routinely. Result: inf, then inf−inf = nan, then a dead run.
Catastrophic cancellation
Subtracting two nearly equal numbers deletes their shared leading digits; the relative error of the inputs is amplified by roughly 1/(1−a/b). The true digits are gone before the subtraction — rounding destroyed them on the way in.

The tricks below are not approximations. Each computes the same mathematical quantity along a different route — one that avoids visiting any intermediate value the format cannot hold. The cost is a few extra operations; the alternative is silent garbage.

02 · Against overflowMax-subtraction and log-sum-exp

Softmax: subtract the max

# naive: exp(z) overflows for z ≳ 89 in fp32
p = exp(z) / exp(z).sum()
# stable: shift by max(z) — identical result, exponents now ≤ 0
z = z - z.max()
p = exp(z) / exp(z).sum()

Softmax is shift-invariant — the e−max factors cancel in the ratio — so this is exact. After the shift the largest exponent is 0 and every term lies in (0, 1]; the only values lost are ones that underflow to 0, and those contributed nothing anyway. Every framework softmax does this internally; you re-meet the problem the day you write a custom kernel or a fused attention score.

Log-sum-exp: same shift, log domain

log Σ ezi  =  m + log Σ ezi − m,    m = maxi zi

This is how cross-entropy is computed without ever materialising the probabilities — which is also why log_softmax + NLL (or logits straight into cross_entropy) is correct and log(softmax(z)) is the classic bug: the inner softmax underflows small probabilities to exactly 0, and log(0) = −inf.

Log-space probabilities

# naive: a product of 500 probabilities ~0.1 underflows fp64 around step 308
p_seq = prod(p_t)
# stable: sum the logs; compare, rank, and accumulate in log space
logp_seq = sum(log(p_t))      # combine alternatives with logsumexp, never exp-then-add

Long products of probabilities (sequence likelihoods, HMMs, beam search scores) leave representable range fast; their logs stay small and well-behaved. The rule generalises: stay in log space until the final moment a probability is genuinely needed.

03 · Against cancellationWelford and Kahan

Variance: never E[x²] − E[x]²

# naive: two big near-equal numbers subtracted — can even go negative
var = (x**2).mean() - x.mean()**2
# Welford: running mean, accumulate squared deviations from it
mean += (x_i - mean) / n
M2   += (x_i - mean) * (x_i - mean_prev)   # var = M2 / n

For data with mean 1000 and standard deviation 1, the naive formula subtracts two numbers near 106 that agree in their first six digits — in fp32 that is all of them, and the "variance" comes out as rounding noise, sometimes negative. Welford's update sums small squared deviations directly; nothing large is ever cancelled. This is why normalisation statistics are computed in fp32 even in mixed-precision training, and in Welford form inside good layernorm kernels.

Kahan summation: carry the lost digits

s, c = 0, 0
for x in xs:
    y = x - c            # re-inject what the last add dropped
    t = s + y
    c = (t - s) - y      # algebraically 0; in floats, the rounding error
    s = t

Adding a small number to a large running sum drops the small number's low bits. Kahan computes exactly what was dropped and feeds it back, cutting error from O(n·eps) to O(eps) for roughly 4× the additions (often hidden under memory latency). The mainstream alternatives are simpler: accumulate in fp32/fp64, or sum pairwise as tree reductions on GPUs already do.

04 · The subtle oneWhere the eps goes

The eps in Adam's denominator looks like boilerplate, but its placement changes the optimizer:

inside:  m̂ / √(v̂ + ε)    vs    outside:  m̂ / (√v̂ + ε)

When v̂ is tiny (flat directions, early steps), the two differ enormously: with ε = 10−8, inside gives a denominator of 10−4, outside gives 10−8 — a 10,000× difference in effective step size in exactly the directions where v carries no information. The published Adam uses outside; some frameworks and papers (and TensorFlow's epsilon at one point) differ, which is one reason "the same optimizer, the same hyperparameters" can fail to reproduce across frameworks. Treat eps as a real hyperparameter — large models often run it at 10−5 or even larger in bf16, precisely so the denominator stays out of the noise floor.

The general rule behind every eps: it must be added before the operation that explodes (the division, the sqrt, the log), and it must be large relative to the rounding noise of the precision you compute in — an ε of 10−8 is invisible in bf16 arithmetic, where machine epsilon is ~4×10−3.

05 · Field guideSymptom → suspect

SymptomSuspectMove
nan appears in attention or logits at long context / large scaleexp overflowMax-subtraction; compute scores in fp32
−inf loss on confident exampleslog(softmax)log_softmax / logsumexp on logits
Sequence scores all collapse to 0probability underflowLog-space accumulation
Negative or jittery variance in a custom normcancellationWelford, fp32 statistics
Metrics drift with reduction order / batch sizeaccumulated roundingfp32 or Kahan/pairwise accumulation
Optimizer behaves differently across frameworkseps placement or sizeCheck inside-vs-outside sqrt and the value of ε
Mental Model