LLMs

The 1/√d Attention Scaling Factor

One constant, chosen so softmax keeps its gradients

01 · First principlesWhy does d appear in the formula at all?

Attention scores are dot products: s = q·k, with q and k of dimension d. The formula in the paper is softmax(QK^T/√d)V, and the √d looks like a fudge until you ask one statistical question — as d grows, what happens to the size of a typical dot product?

Assume the entries of q and k are roughly independent with mean 0 and variance 1 (which is what sensible initialisation and layer norm aim for). Then:

s = Σ_i=1^d q_ik_i ⇒ E[s] = 0, Var[s] = d

each term has variance 1; d independent terms add

The variance is the sum of d unit variances. Typical logits are therefore of size ±√d — about ±8 at d = 64, ±11 at d = 128. The logits grow with a hyperparameter that has nothing to do with how confident attention ought to be.

02 · The failureBig logits kill softmax gradients

Softmax with large-magnitude inputs saturates: the largest logit takes probability ≈ 1, everything else ≈ 0. The gradient of softmax involves terms p_i(1 − p_i) and −p_ip_j, and every one of them vanishes when each p is pinned near 0 or 1. A saturated attention head is an argmax — frozen, and unteachable, because no gradient flows back through its weights.

So the failure chain is exact: larger d → logit variance d → saturated softmax → near-zero gradients → attention stops learning. The damage scales with the head dimension, silently, and would make wide heads strictly worse to train.

The same five scores, before and after scaling. Saturation turns a learnable weighting into a frozen argmax.

03 · The fixDivide by the standard deviation

If Var[s] = d, then the standard deviation is √d, and dividing by it is just standardisation — the same move as normalising any random variable:

Var[s / √d] = d / (√d)² = 1

logit variance restored to 1, independent of d

Logits now live at O(1) regardless of head width, softmax operates in its responsive regime at initialisation, and the head dimension can be chosen for capacity reasons without touching training dynamics. That is the whole trick: √d is not tuned, it is derived.

Note the scope: the variance-1 argument holds at initialisation. Training can move logit scale afterwards (and entropy-collapse pathologies at long context motivate variants like QK-norm), but starting in the responsive regime is what lets training begin at all.

04 · PlacementThe same instinct elsewhere

This is one instance of a recurring pattern: keep activations at unit scale so that gradients neither vanish nor explode. Xavier/He initialisation divides by fan-in for the same reason; layer norm enforces it dynamically. Attention needed its own copy of the trick because the dot product re-introduces a d-dependence that initialisation alone cannot see. The temperature knob in decoding is the same lever pulled deliberately — dividing logits to soften a distribution — except there it is a sampling choice, and here it is a gradient-survival requirement.

Mental Model

A dot product of two d-dimensional unit-variance vectors has variance d; typical logits grow like √d.
Softmax saturates on large logits, and saturated softmax has near-zero gradients — the head stops learning.
Dividing by √d is standardisation: it restores logit variance to 1 for any head width.
It is a derived constant, not a hyperparameter; the same "keep things unit-scale" instinct as Xavier init and layer norm.