Applied ML

Floating Point Representation

A fixed budget of bits against an infinite number line

01 · First principlesBase-2 scientific notation

There are uncountably many reals and 2³² patterns in 32 bits, so any representation is a choice about which numbers to keep. Fixed-point keeps evenly spaced ones and dies at both ends of the scale. Floating point keeps numbers that are relatively evenly spaced — dense near zero, sparse far away — which fits computation, where what usually matters is relative error.

The format is base-2 scientific notation packed into bits: sign s, exponent e (stored with a bias), mantissa m (with an implicit leading 1 for normal numbers):

x = (−1)^s · 1.m · 2^{e − bias}

exponent bits buy range mantissa bits buy precision

Every format below is just a different split of one budget. Exponent bits decide how far from zero you can go; mantissa bits decide how many significant digits you carry. You cannot have both in 16 bits, which is the entire fp16-vs-bf16 story in mixed precision.

02 · The zooLayouts that matter in ML

One budget, different splits. Note bf16 and tf32 share fp32's 8-bit exponent; fp16 and e5m2 share a 5-bit one.

Format	Bits (s/e/m)	Max normal	Machine eps (≈ rel. error)	Role
fp32	1/8/23	~3.4×10³⁸	~1.2×10⁻⁷	Master weights, optimizer state, reductions
tf32	1/8/10	~3.4×10³⁸	~4.9×10⁻⁴	What "fp32 matmuls" silently become on Ampere+
fp16	1/5/10	65504	~4.9×10⁻⁴	Inference; training with loss scaling
bf16	1/8/7	~3.4×10³⁸	~3.9×10⁻³	Default training compute type
fp8 e4m3	1/4/3	448	~6×10⁻²	FP8 weights/activations (per-tensor scaling required)
fp8 e5m2	1/5/2	57344	~1.3×10⁻¹	FP8 gradients

Machine epsilon is the spacing between 1.0 and the next representable number, roughly 2^{−(mantissa bits)}; it is the relative error of a single rounding. Every individual float operation is exact-then-rounded: the IEEE guarantee is fl(a∘b) = (a∘b)(1+δ) with |δ| ≤ eps. One operation is fine. The trouble is composition.

03 · The consequenceNon-associativity

Because every add rounds, (a + b) + c ≠ a + (b + c) in general. A two-line demonstration:

# fp32: 1e8 swallows the 1s — eps · 1e8 ≈ 12, so +1 rounds away
(1e8 + 1.0) + 1.0   # → 1e8        (each +1 lost separately)
1e8 + (1.0 + 1.0)   # → 100000002  (the 2 survives together)

Now scale that up: a sum over a million gradient elements has a different value for every ordering, and a GPU reduction's ordering depends on how the kernel split the work — which depends on block scheduling, which is not deterministic. This is why the same training script on the same data can produce bitwise-different losses run to run, why torch.use_deterministic_algorithms(True) costs speed (it forces fixed reduction orders), and why an allreduce across a different number of ranks gives bitwise-different gradients. Not a bug; the arithmetic itself is order-sensitive. (When the differences are large, you have a conditioning problem — see precision tricks.)

04 · The edgesSubnormals, and the values that are not numbers

Subnormals. Below the smallest normal number, the implicit leading 1 is dropped and the mantissa alone counts down toward zero. This buys "gradual underflow" — values shrink smoothly instead of cliff-dropping to 0 — at the cost of precision that degrades digit by digit. In fp16 the subnormal zone (6×10⁻⁸ to 6×10⁻⁵) is exactly where un-scaled gradients live, which is the quantitative core of the loss-scaling story. On some hardware subnormal arithmetic is also slow.
inf and nan. The all-ones exponent is reserved: with zero mantissa it is ±inf (overflow's destination), otherwise nan. nan propagates through everything and compares unequal even to itself — the reason x != x is a legitimate nan test, and the reason one overflow anywhere eventually paints the whole loss nan.
Signed zero, rounding. +0 and −0 both exist; default rounding is round-to-nearest-even, which is why 0.5 rounds to 0 and 1.5 to 2 — banker's rounding, chosen to avoid systematic drift in long accumulations.

Why this note exists: every numerics topic nearby — mixed precision, stability tricks, nondeterminism, FP8 scaling — reduces to three facts: exponent bits are range, mantissa bits are relative precision, and every operation rounds. Hold those and the rest is derivable.

Mental Model

A float is (−1)^s · 1.m · 2^e−bias: exponent bits buy range, mantissa bits buy significant digits, one budget split differently per format.
Machine eps ≈ 2^−mantissa is the relative error of one rounding; errors compound through composition, not single ops.
Rounding makes addition non-associative, so reduction order changes results — parallel sums are nondeterministic by nature.
Subnormals fade to zero gradually; fp16's subnormal zone sits exactly where raw gradients live.
bf16 = fp32's range with 8 fewer digits; fp16 = more digits, range to 65504; fp8 splits both ways (e4m3 vs e5m2) because no 8-bit split serves weights and gradients at once.