A small latent array reads an arbitrarily large input
Self-attention is the great generalist: it makes no grid assumptions, no locality assumptions, and works on any set of tokens. So why not feed it anything — raw pixels, audio samples, point clouds? Count the tokens. A modest 224×224 image is 50,176 pixels; one second of audio is tens of thousands of samples; video multiplies both.
Self-attention over M inputs costs O(M²) per layer. At M = 50k, that is 2.5 billion score entries per head per layer (a non-starter). The standard escapes each smuggle domain knowledge back in: convolutions assume a grid, patching (ViT) assumes a grid, spectrograms assume audio. The Perceiver's question is stricter — can one architecture eat any modality raw, without the quadratic bill and without modality-specific preprocessing?
The cost of attention is (number of queries) × (number of keys). Self-attention is quadratic only because the same huge array plays both roles. Nothing in the mechanism requires that. If queries came from somewhere small, the bill would change shape:
So the Perceiver introduces a latent array: N learned vectors (N ≈ 256–1024, fixed, independent of input size). The latents are the readers; the input is the book. Each latent cross-attends over all M inputs and pulls in what it needs. Cost is now linear in the input.
The expensive read happens once (or a few times); the deep thinking happens entirely in the small latent space.
The second half of the trick: once information is inside the latents, all further processing is self-attention among the latents — O(N²) with N ≈ 512, which is pocket change. Depth is nearly free, so the model can be very deep where it reasons and shallow where it reads. The analogy is a committee of N journalists covering a city of M people: they interview widely (cross-attention), then deliberate among themselves (latent self-attention), and may go back out for follow-up interviews.
The original Perceiver could only emit a single pooled output (a class label). Perceiver IO closes the loop with the same trick run in reverse: to produce outputs, build an output query array — one query per desired output element, carrying its position or identity — and let those queries cross-attend to the final latents:
Want a dense optical-flow map? One query per pixel. Want text? One query per token slot. Input size M, output size O, and model depth L are now three fully independent dials — no other architecture of its time decoupled all three.
| Property | Transformer (self-attn) | Perceiver IO |
|---|---|---|
| Cost in input size M | O(M²) | O(M) per read |
| Modality assumptions | needs tokenisation / patching | raw bytes/pixels + position features |
| Information path | any token ↔ any token, every layer | everything via N-latent bottleneck |
| Depth cost | O(M²) per layer | O(N²) per layer |
The cost is the bottleneck row: tasks needing fine-grained token-to-token interaction at full resolution are a poor fit, and for pure language the ordinary transformer (whose tokeniser already shrinks M) remained the better tool. The Perceiver's ideas outlived the model itself — learned latent queries reading a big array via cross-attention is now everywhere: Flamingo's resampler, BLIP-2's Q-Former, DETR-style decoders. See Cross Attention for the underlying mechanism.