LLMs

Cross Attention

My sequence asks questions of your sequence

01 · First principlesAttention never required one sequence

Write attention out and notice what it actually assumes:

Attn(Q, K, V) = softmax(QK^T/√d) V

who is asking who is being consulted

Nothing in this formula says Q and K/V must come from the same place. Self-attention is the special case where one sequence plays both roles. Cross attention is the general case made explicit: queries are projected from sequence A, keys and values from sequence B. Each position of A asks "what in B is relevant to me?", and receives a B-flavoured summary weighted by relevance. A is the reader; B is the book.

The problem that forces this to exist is conditioning. Whenever generation must be steered by something else — a source sentence, a text prompt for an image, an audio clip — that something arrives as its own sequence, in its own representation space, with its own length. You need a differentiable bridge that lets every generated position consult every condition position. Cross attention is that bridge, and essentially the only one in use.

02 · The contrastSelf vs cross, term by term

	Self-attention	Cross attention
Q projected from	sequence X	target / generated stream A
K, V projected from	the same X	conditioning stream B
Score matrix shape	N × N	N_A × N_B (rectangular)
Answers the question	"how do my own parts relate?"	"what over there matters to me here?"
Causal mask	yes, in decoders (see causal attention)	none — B is fully known; the future of A is the only secret, and B is not A's future
Information flow	within a stream	strictly B → A (A cannot alter B's representations)
KV cache at inference	grows with each generated token	computed once from B, reused for every decoding step

The last row is an easily missed efficiency win: the conditioning sequence does not change during generation, so its K and V are computed exactly once, however many tokens are decoded against them.

03 · The geometryOne rectangular score matrix

Translation, mid-generation. The English token being produced sends its query across; the French keys answer; "chat" wins the softmax and its value dominates the summary.

Stack this per pair of positions and the score matrix is N_A × N_B — generally rectangular, which is the visible signature that two different sequences are involved. Soft alignment between source and target words, learned with no alignment supervision, was the original 2014 motivation (Bahdanau's attention predates the transformer; cross attention is its direct descendant).

04 · Where it livesTwo canonical homes

Encoder–decoder translation (the original transformer). Each decoder block runs causal self-attention over the target produced so far, then cross attention with Q from the target and K, V from the encoder's output, then the MLP. The decoder phrases what it needs; the frozen-for-this-step encoder supplies it. Every generated word gets a fresh, position-specific look at the entire source — not a single squashed summary vector, which was precisely the RNN bottleneck attention was invented to break.

Text-conditioned diffusion (Stable Diffusion). The U-Net denoises an image; at several resolutions, image patches emit queries and the text prompt's encoder states supply keys and values. Each region of the image independently asks the prompt what it should contain. The rectangular attention maps are interpretable enough that prompt-editing methods manipulate them directly to move or restyle objects. Multimodal LLMs (Flamingo-style) use the same pattern with vision features as B — whenever two modalities meet, this is the joint.

Design reading: cross attention is the architecture's expression of a one-way dependency. Put B in K/V and B shapes A without being shaped by it — the condition stays fixed, the generation bends around it. When you see a cross-attention block, read it as an arrow.

Mental Model

Attention is a query–database operation; self-attention just happens to query its own database. Cross attention splits the roles: Q from the asking stream, K/V from the consulted one.
It exists because conditioning on another sequence needs a bridge: every generated position consulting every condition position, differentiably.
Rectangular N_A × N_B scores, no causal mask, information flowing strictly B → A.
The condition's K/V are computed once and reused for the whole generation — cheap by construction.
Translation decoders and text-to-image U-Nets are the same pattern: the reader changes, the book does not.