LLMs

Transformer-XL

Recurrence between segments, so context survives the window edge

01 · First principlesWhy does a window exist at all?

Attention costs O(N²), so a vanilla transformer trains on text chopped into fixed segments of, say, 512 tokens. Each segment is processed in isolation. The question that forces Transformer-XL to exist: what happens to information at the boundary between segments?

It is destroyed. Token 513 begins a new segment with no memory that tokens 1–512 ever existed, even if token 512 was the subject of its sentence. The literature calls this context fragmentation: the model's effective memory is not "the last 512 tokens" but "the tokens since the last arbitrary chop point," which near a boundary is almost nothing.

Two distinct failures. Training: early tokens of every segment learn from junk context, wasting capacity. Inference: to give every position a full window you must re-encode an entire 512-token segment after each single-token slide — quadratic work per token (intolerably slow).

02 · The naive fixJust make the window bigger?

Doubling the segment length pays O(N²) for every doubling and merely moves the cliff; the boundary failure is structural, not a matter of size. What we actually want is what RNNs had all along — state that carries over — without giving up parallel training within a segment. So the fix is to graft recurrence onto the transformer, but at the level of segments rather than tokens.

03 · The mechanismSegment-level recurrence with a frozen cache

While processing segment τ, every layer is allowed to attend not only to this segment's hidden states but also to the cached hidden states of the previous segment, held fixed:

h̃ⁿ⁻¹_τ = [ SG(hⁿ⁻¹_τ−1) ∘ hⁿ⁻¹_τ ] → K, V from h̃, Q from hⁿ⁻¹_τ only

cached memory (stop-gradient) current segment

SG is stop-gradient: the cache is read-only. Queries come only from the current segment (we do not recompute the past); keys and values span both. The stop-gradient is what keeps training affordable — backprop never unrolls into previous segments, so the cost per segment is unchanged, yet information still flows forward through the cache.

Each layer reaches one segment further back than the layer below, so an L-layer model sees O(L × segment) tokens — context compounds with depth.

At inference the same cache makes generation fast: slide forward by reusing cached states instead of re-encoding the window, which made evaluation up to ~1800× faster than the recompute-everything baseline.

04 · The bug this createsAbsolute positions lie across segments

Recurrence breaks the original positional scheme, and it is worth seeing precisely how. With absolute encodings, every segment stamps its tokens with positions 0…511. Now the cached segment and the current segment both carry the same stamps:

pos(token in cache) = 17 and pos(token now) = 17 ⇒ indistinguishable to attention

The model cannot tell "the 17th token of the previous segment" from "the 17th token of this one" (a catastrophe for any order-sensitive pattern). The information that actually matters to attention was never "where am I on an absolute ruler" but "how far apart are query and key."

05 · The repairRelative positional encodings

So Transformer-XL rewrites the attention score to depend only on the offset i − j. Expanding the standard score with absolute embeddings gives four terms (content–content, content–position, position–content, position–position); the fix replaces every absolute position vector with a sinusoidal encoding of the relative distance R_i−j, plus two learned global bias vectors u, v that stand in for "the query's own position," which no longer needs to exist:

A_ij = q_i^Tk_j + q_i^TW_RR_i−j + u^Tk_j + v^TW_RR_i−j

relative distance learned global biases

A distance of 600 tokens now means the same thing whether or not a segment boundary sits in between. Recurrence and relative positions are not two separate features; the second is what makes the first coherent. The broader family of relative schemes — Shaw, T5 buckets, ALiBi — is covered in Relative Positional Embeddings, and the modern descendant of this idea is RoPE.

06 · The ledgerCosts and legacy

Memory: the cache stores hidden states for every layer — essentially a KV cache before the name existed; that is the price of the long context.
Truncated gradients: stop-gradient means long-range credit assignment is approximate; the model uses distant context but is never trained through it.
Legacy: the cached-states idea became the inference KV cache used by every modern LLM, and it made relative position the default assumption of the field.

Mental Model

Fixed segments do not shorten memory; they shatter it at arbitrary chop points.
Fix: let each segment attend to the previous segment's cached hidden states, with stop-gradient so training cost is unchanged.
Context compounds with depth: layer n reaches n segments back.
Absolute positions collide across segments; only relative offsets stay meaningful, so the positional scheme must follow the recurrence.
Read it as the origin story of the KV cache plus the case for relative position.