One constant, chosen so softmax keeps its gradients
Attention scores are dot products: s = q·k, with q and k of dimension d. The formula in the paper is softmax(QKT/√d)V, and the √d looks like a fudge until you ask one statistical question — as d grows, what happens to the size of a typical dot product?
Assume the entries of q and k are roughly independent with mean 0 and variance 1 (which is what sensible initialisation and layer norm aim for). Then:
The variance is the sum of d unit variances. Typical logits are therefore of size ±√d — about ±8 at d = 64, ±11 at d = 128. The logits grow with a hyperparameter that has nothing to do with how confident attention ought to be.
Softmax with large-magnitude inputs saturates: the largest logit takes probability ≈ 1, everything else ≈ 0. The gradient of softmax involves terms pi(1 − pi) and −pipj, and every one of them vanishes when each p is pinned near 0 or 1. A saturated attention head is an argmax — frozen, and unteachable, because no gradient flows back through its weights.
So the failure chain is exact: larger d → logit variance d → saturated softmax → near-zero gradients → attention stops learning. The damage scales with the head dimension, silently, and would make wide heads strictly worse to train.
The same five scores, before and after scaling. Saturation turns a learnable weighting into a frozen argmax.
If Var[s] = d, then the standard deviation is √d, and dividing by it is just standardisation — the same move as normalising any random variable:
Logits now live at O(1) regardless of head width, softmax operates in its responsive regime at initialisation, and the head dimension can be chosen for capacity reasons without touching training dynamics. That is the whole trick: √d is not tuned, it is derived.
This is one instance of a recurring pattern: keep activations at unit scale so that gradients neither vanish nor explode. Xavier/He initialisation divides by fan-in for the same reason; layer norm enforces it dynamically. Attention needed its own copy of the trick because the dot product re-introduces a d-dependence that initialisation alone cannot see. The temperature knob in decoding is the same lever pulled deliberately — dividing logits to soften a distribution — except there it is a sampling choice, and here it is a gradient-survival requirement.