A multi-frequency clock that turns position into geometry
01 · First principlesAttention cannot see order
Attention is a function of sets. Permute the input tokens and the outputs permute identically — nothing in QKTV depends on where a token sits, only on what it is. "Dog bites man" and "man bites dog" produce the same bag of attention outputs. This is called permutation equivariance, and for language it is a defect: order carries meaning, so order must be injected, because the architecture will never recover it on its own.
The cheapest injection is to add a position-dependent vector pt to each token embedding. The whole design question is: what should pt be?
02 · Naive attemptsTwo obvious encodings, two failures
pₜ = t (a counter)
Unbounded: position 5000 produces a vector component 5000× larger than position 1, drowning the token embedding it is added to. Scale depends on sequence length; nothing about it is stable.
pₜ = t / N (normalised counter)
Bounded, but the meaning of a step now depends on N: positions 3 and 4 are 0.1 apart in a 10-token sequence and 0.001 apart in a 1000-token one. "Next word" should mean the same thing everywhere.
What we actually want is a code where every position is distinct, every component is bounded, and — crucially — the relationship between positions t and t+k looks the same for every t. Adjacency should be a fixed geometric fact.
03 · The fixSinusoids at geometrically spaced frequencies
The Transformer paper's encoding fills the d-dimensional position vector with sine/cosine pairs, each pair ticking at its own frequency:
frequencies from 1 down to 1/10000, geometrically spaced
The right analogy is binary counting. Write integers in binary and the lowest bit flips every step, the next every two steps, the next every four — fast and slow dials together identify the number uniquely. Sinusoids are the smooth version of the same idea: dimension pair 0 spins quickly (distinguishing neighbours), the last pair drifts over thousands of positions (encoding coarse location), and the full vector is a clock with d/2 hands. Bounded in [−1, 1], distinct for every position, no length normalisation anywhere.
Three of the d/2 frequency channels. Reading all hands at once identifies t uniquely, like reading a clock's second, minute and hour hands.
04 · The property that mattersAn offset is a linear map
Here is the reason this encoding was chosen over any other bounded injective code. The angle-addition identities say:
In matrix form: pt+k = Rk pt, where Rk is a block-diagonal rotation whose angles depend only on the offset k. "Shift by k" is one fixed linear transformation, the same everywhere in the sequence. A linear attention head can therefore learn to ask "what was 3 tokens back?" as a single matrix — relative position becomes learnable by linear machinery, which is exactly what the model has. This rotation property is the seed that RoPE later promotes from "available if learned" to "enforced by construction".
05 · LimitsWhat it does not buy
Additive mixing is blunt. Position is summed into the same vector as content, so every downstream layer must disentangle "what" from "where" by itself. Methods that keep position out of the residual stream (relative biases, RoPE) consistently train better.
Extrapolation is theoretical, not practical. The formula is defined for any t, but models attend poorly at positions longer than anything seen in training; the slow channels enter ranges the network never learned to read.
Learned absolute embeddings (GPT-2 style) match it in quality at trained lengths and fail harder beyond them — which tells you the sinusoid's elegance was never the binding constraint at short context.
Status today: rarely used in new LLMs, but the canonical baseline, and the cleanest place to learn why all positional schemes are really about making relative offsets easy to compute.
Mental Model
Attention is permutation-equivariant; word order does not exist until you add it.
Sinusoidal encoding is a smooth binary counter: d/2 clock hands at geometrically spaced frequencies, bounded and unique.
The load-bearing property: pt+k = Rkpt — a fixed offset is one fixed rotation, so relative position is linearly accessible.
Additive position mixing pollutes the residual stream; later schemes move position into the attention computation instead.