A fixed budget of bits against an infinite number line
There are uncountably many reals and 232 patterns in 32 bits, so any representation is a choice about which numbers to keep. Fixed-point keeps evenly spaced ones and dies at both ends of the scale. Floating point keeps numbers that are relatively evenly spaced — dense near zero, sparse far away — which fits computation, where what usually matters is relative error.
The format is base-2 scientific notation packed into bits: sign s, exponent e (stored with a bias), mantissa m (with an implicit leading 1 for normal numbers):
Every format below is just a different split of one budget. Exponent bits decide how far from zero you can go; mantissa bits decide how many significant digits you carry. You cannot have both in 16 bits, which is the entire fp16-vs-bf16 story in mixed precision.
One budget, different splits. Note bf16 and tf32 share fp32's 8-bit exponent; fp16 and e5m2 share a 5-bit one.
| Format | Bits (s/e/m) | Max normal | Machine eps (≈ rel. error) | Role |
|---|---|---|---|---|
| fp32 | 1/8/23 | ~3.4×1038 | ~1.2×10−7 | Master weights, optimizer state, reductions |
| tf32 | 1/8/10 | ~3.4×1038 | ~4.9×10−4 | What "fp32 matmuls" silently become on Ampere+ |
| fp16 | 1/5/10 | 65504 | ~4.9×10−4 | Inference; training with loss scaling |
| bf16 | 1/8/7 | ~3.4×1038 | ~3.9×10−3 | Default training compute type |
| fp8 e4m3 | 1/4/3 | 448 | ~6×10−2 | FP8 weights/activations (per-tensor scaling required) |
| fp8 e5m2 | 1/5/2 | 57344 | ~1.3×10−1 | FP8 gradients |
Machine epsilon is the spacing between 1.0 and the next representable number, roughly 2−(mantissa bits); it is the relative error of a single rounding. Every individual float operation is exact-then-rounded: the IEEE guarantee is fl(a∘b) = (a∘b)(1+δ) with |δ| ≤ eps. One operation is fine. The trouble is composition.
Because every add rounds, (a + b) + c ≠ a + (b + c) in general. A two-line demonstration:
# fp32: 1e8 swallows the 1s — eps · 1e8 ≈ 12, so +1 rounds away
(1e8 + 1.0) + 1.0 # → 1e8 (each +1 lost separately)
1e8 + (1.0 + 1.0) # → 100000002 (the 2 survives together)
Now scale that up: a sum over a million gradient elements has a different value for every ordering, and a GPU reduction's ordering depends on how the kernel split the work — which depends on block scheduling, which is not deterministic. This is why the same training script on the same data can produce bitwise-different losses run to run, why torch.use_deterministic_algorithms(True) costs speed (it forces fixed reduction orders), and why an allreduce across a different number of ranks gives bitwise-different gradients. Not a bug; the arithmetic itself is order-sensitive. (When the differences are large, you have a conditioning problem — see precision tricks.)
x != x is a legitimate nan test, and the reason one overflow anywhere eventually paints the whole loss nan.