LLMs

Perceiver

A small latent array reads an arbitrarily large input

01 · First principlesWhat if the input is just… huge?

Self-attention is the great generalist: it makes no grid assumptions, no locality assumptions, and works on any set of tokens. So why not feed it anything — raw pixels, audio samples, point clouds? Count the tokens. A modest 224×224 image is 50,176 pixels; one second of audio is tens of thousands of samples; video multiplies both.

Self-attention over M inputs costs O(M²) per layer. At M = 50k, that is 2.5 billion score entries per head per layer (a non-starter). The standard escapes each smuggle domain knowledge back in: convolutions assume a grid, patching (ViT) assumes a grid, spectrograms assume audio. The Perceiver's question is stricter — can one architecture eat any modality raw, without the quadratic bill and without modality-specific preprocessing?

02 · The pivotWhere exactly does the quadratic come from?

The cost of attention is (number of queries) × (number of keys). Self-attention is quadratic only because the same huge array plays both roles. Nothing in the mechanism requires that. If queries came from somewhere small, the bill would change shape:

self-attention: O(M · M) → cross-attention from N latents: O(N · M), N ≪ M

So the Perceiver introduces a latent array: N learned vectors (N ≈ 256–1024, fixed, independent of input size). The latents are the readers; the input is the book. Each latent cross-attends over all M inputs and pulls in what it needs. Cost is now linear in the input.

03 · The architectureRead once, think cheaply, repeat

The expensive read happens once (or a few times); the deep thinking happens entirely in the small latent space.

The second half of the trick: once information is inside the latents, all further processing is self-attention among the latents — O(N²) with N ≈ 512, which is pocket change. Depth is nearly free, so the model can be very deep where it reasons and shallow where it reads. The analogy is a committee of N journalists covering a city of M people: they interview widely (cross-attention), then deliberate among themselves (latent self-attention), and may go back out for follow-up interviews.

One asymmetry to notice. The latent bottleneck forces compression at the very first layer: whatever the N latents fail to extract is gone. Iterative re-reads exist precisely to soften this — later queries can be conditioned on what the committee already knows.

04 · Perceiver IODecoding by query

The original Perceiver could only emit a single pooled output (a class label). Perceiver IO closes the loop with the same trick run in reverse: to produce outputs, build an output query array — one query per desired output element, carrying its position or identity — and let those queries cross-attend to the final latents:

encode: O(N·M) → process: L × O(N²) → decode: O(O·N)

Want a dense optical-flow map? One query per pixel. Want text? One query per token slot. Input size M, output size O, and model depth L are now three fully independent dials — no other architecture of its time decoupled all three.

05 · The ledgerWhat it costs, where it landed

Property	Transformer (self-attn)	Perceiver IO
Cost in input size M	O(M²)	O(M) per read
Modality assumptions	needs tokenisation / patching	raw bytes/pixels + position features
Information path	any token ↔ any token, every layer	everything via N-latent bottleneck
Depth cost	O(M²) per layer	O(N²) per layer

The cost is the bottleneck row: tasks needing fine-grained token-to-token interaction at full resolution are a poor fit, and for pure language the ordinary transformer (whose tokeniser already shrinks M) remained the better tool. The Perceiver's ideas outlived the model itself — learned latent queries reading a big array via cross-attention is now everywhere: Flamingo's resampler, BLIP-2's Q-Former, DETR-style decoders. See Cross Attention for the underlying mechanism.

Mental Model

Attention cost = queries × keys; the quadratic exists only because one array plays both roles.
Fix: a small learned latent array asks the questions — cross-attention makes the read O(N·M), linear in input.
Think deeply where it is cheap: latent self-attention at O(N²), so depth decouples from input size.
Perceiver IO decodes the same way in reverse: one query per output element.
The price is a fixed-width bottleneck — the committee can only remember N notebooks' worth.