LLMs

Cross Attention

My sequence asks questions of your sequence

01 · First principlesAttention never required one sequence

Write attention out and notice what it actually assumes:

Attn(Q, K, V) = softmax(QKT/√d) V
who is asking who is being consulted

Nothing in this formula says Q and K/V must come from the same place. Self-attention is the special case where one sequence plays both roles. Cross attention is the general case made explicit: queries are projected from sequence A, keys and values from sequence B. Each position of A asks "what in B is relevant to me?", and receives a B-flavoured summary weighted by relevance. A is the reader; B is the book.

The problem that forces this to exist is conditioning. Whenever generation must be steered by something else — a source sentence, a text prompt for an image, an audio clip — that something arrives as its own sequence, in its own representation space, with its own length. You need a differentiable bridge that lets every generated position consult every condition position. Cross attention is that bridge, and essentially the only one in use.

02 · The contrastSelf vs cross, term by term

Self-attentionCross attention
Q projected fromsequence Xtarget / generated stream A
K, V projected fromthe same Xconditioning stream B
Score matrix shapeN × NNA × NB (rectangular)
Answers the question"how do my own parts relate?""what over there matters to me here?"
Causal maskyes, in decoders (see causal attention)none — B is fully known; the future of A is the only secret, and B is not A's future
Information flowwithin a streamstrictly B → A (A cannot alter B's representations)
KV cache at inferencegrows with each generated tokencomputed once from B, reused for every decoding step

The last row is an easily missed efficiency win: the conditioning sequence does not change during generation, so its K and V are computed exactly once, however many tokens are decoded against them.

03 · The geometryOne rectangular score matrix

SEQUENCE B (CONDITION): "Le chat est noir" Le chat est noir "cat" (A, t=2) QUERY FROM THE TARGET STREAM weight 0.86 → "chat" keys, values: from B

Translation, mid-generation. The English token being produced sends its query across; the French keys answer; "chat" wins the softmax and its value dominates the summary.

Stack this per pair of positions and the score matrix is NA × NB — generally rectangular, which is the visible signature that two different sequences are involved. Soft alignment between source and target words, learned with no alignment supervision, was the original 2014 motivation (Bahdanau's attention predates the transformer; cross attention is its direct descendant).

04 · Where it livesTwo canonical homes

Encoder–decoder translation (the original transformer). Each decoder block runs causal self-attention over the target produced so far, then cross attention with Q from the target and K, V from the encoder's output, then the MLP. The decoder phrases what it needs; the frozen-for-this-step encoder supplies it. Every generated word gets a fresh, position-specific look at the entire source — not a single squashed summary vector, which was precisely the RNN bottleneck attention was invented to break.

Text-conditioned diffusion (Stable Diffusion). The U-Net denoises an image; at several resolutions, image patches emit queries and the text prompt's encoder states supply keys and values. Each region of the image independently asks the prompt what it should contain. The rectangular attention maps are interpretable enough that prompt-editing methods manipulate them directly to move or restyle objects. Multimodal LLMs (Flamingo-style) use the same pattern with vision features as B — whenever two modalities meet, this is the joint.

Design reading: cross attention is the architecture's expression of a one-way dependency. Put B in K/V and B shapes A without being shaped by it — the condition stays fixed, the generation bends around it. When you see a cross-attention block, read it as an arrow.
Mental Model