LLMs

The 1/√d Attention Scaling Factor

One constant, chosen so softmax keeps its gradients

01 · First principlesWhy does d appear in the formula at all?

Attention scores are dot products: s = q·k, with q and k of dimension d. The formula in the paper is softmax(QKT/√d)V, and the √d looks like a fudge until you ask one statistical question — as d grows, what happens to the size of a typical dot product?

Assume the entries of q and k are roughly independent with mean 0 and variance 1 (which is what sensible initialisation and layer norm aim for). Then:

s = Σi=1d qiki   ⇒   E[s] = 0,   Var[s] = d
each term has variance 1; d independent terms add

The variance is the sum of d unit variances. Typical logits are therefore of size ±√d — about ±8 at d = 64, ±11 at d = 128. The logits grow with a hyperparameter that has nothing to do with how confident attention ought to be.

02 · The failureBig logits kill softmax gradients

Softmax with large-magnitude inputs saturates: the largest logit takes probability ≈ 1, everything else ≈ 0. The gradient of softmax involves terms pi(1 − pi) and −pipj, and every one of them vanishes when each p is pinned near 0 or 1. A saturated attention head is an argmax — frozen, and unteachable, because no gradient flows back through its weights.

So the failure chain is exact: larger d → logit variance d → saturated softmax → near-zero gradients → attention stops learning. The damage scales with the head dimension, silently, and would make wide heads strictly worse to train.

SOFTMAX OF s · (UNSCALED) p ≈ one-hot · gradients ≈ 0 SOFTMAX OF s / √d soft distribution · gradients flow

The same five scores, before and after scaling. Saturation turns a learnable weighting into a frozen argmax.

03 · The fixDivide by the standard deviation

If Var[s] = d, then the standard deviation is √d, and dividing by it is just standardisation — the same move as normalising any random variable:

Var[s / √d] = d / (√d)² = 1
logit variance restored to 1, independent of d

Logits now live at O(1) regardless of head width, softmax operates in its responsive regime at initialisation, and the head dimension can be chosen for capacity reasons without touching training dynamics. That is the whole trick: √d is not tuned, it is derived.

Note the scope: the variance-1 argument holds at initialisation. Training can move logit scale afterwards (and entropy-collapse pathologies at long context motivate variants like QK-norm), but starting in the responsive regime is what lets training begin at all.

04 · PlacementThe same instinct elsewhere

This is one instance of a recurring pattern: keep activations at unit scale so that gradients neither vanish nor explode. Xavier/He initialisation divides by fan-in for the same reason; layer norm enforces it dynamically. Attention needed its own copy of the trick because the dot product re-introduces a d-dependence that initialisation alone cannot see. The temperature knob in decoding is the same lever pulled deliberately — dividing logits to soften a distribution — except there it is a sampling choice, and here it is a gradient-survival requirement.

Mental Model