General ML

S4 / State Space Models

One linear system, two faces: recurrence for inference, convolution for training

01 · First principlesThe problem that forces SSMs to exist

Sequence modelling after transformers had an unresolved tension. RNNs infer cheaply — O(1) state per step — but train sequentially and forget over long horizons. Transformers train in parallel and recall anything, but pay O(T²) attention and carry a cache that grows with the sequence. The question: is there a model that trains like a transformer and infers like an RNN?

State space models answer by going back to a much older object — the linear dynamical system of control theory — and noticing it has exactly the two faces required.

x′(t) = A x(t) + B u(t)   ·   y(t) = C x(t)
state dynamicsinput writereadout

A hidden state x evolves continuously under A, is written to by the input u through B, and is read out through C. Discretise with a step size Δ (zero-order hold gives A̅ = exp(ΔA), with a matching B̅) and you get a sequence-to-sequence map ready for tokens.

02 · The dualityRecurrence and convolution are the same system

Because the system is linear and time-invariant, the discretised model can be computed two ways, and this duality is the defining idea of the whole field.

Face 1 · recurrence (inference)
xk = A̅xk−1 + B̅uk,  yk = Cxk. Step-by-step like an RNN: O(1) memory, constant time per token, no cache growing with context. Exactly what generation and streaming want.
Face 2 · convolution (training)
Unroll the recurrence: yk = Σj CA̅jB̅ · uk−j — a 1-D convolution of the whole input with one fixed kernel K̅ = (CB̅, CA̅B̅, CA̅²B̅, …). Compute K̅ once, convolve via FFT: O(T log T), fully parallel. Exactly what GPU training wants.
y = K̅ ∗ u,    K̅j = C A̅j B̅   —  the entire history compressed into one global kernel

Same weights, same outputs, two execution plans. Linearity is the price of admission: a tanh anywhere in the state update (as in an RNN) destroys the unrolling, and with it the convolutional face. S4's bet is that the nonlinearity can live between layers (mixing, gating) while the time axis itself stays linear.

x′ = Ax + Bu y = Cx UNROLL AS RECURRENCE x₁ x₂ x₃ O(1) memory per-token inference UNROLL AS CONVOLUTION u₁ u₂ u₃ … u_T K̅ = (CB̅, CA̅B̅, …) FFT · O(T log T) · parallel ONE LINEAR SYSTEM · TWO EXECUTION PLANS

Train through the convolutional face, deploy through the recurrent face; the parameters never know the difference.

03 · The hard partHiPPO: an A that remembers

A random A fails for the same reason RNNs fail: powers A̅j either decay to nothing or blow up, so the kernel K̅ is effectively short and the long-range memory is fictional. S4's contribution was a principled initialisation. The HiPPO matrix is derived by asking: what dynamics make the state x(t) the optimal compression of the input's history — concretely, its coefficients in a Legendre-polynomial basis, continuously updated as new input arrives?

With that A, the state is an online summary of the whole past by construction, and the kernel carries usable mass across thousands of steps (S4 was the first model to crack Path-X, at sequence length 16,384). The rest of S4 is numerical engineering — a structured (diagonal-plus-low-rank, later just diagonal in S4D/S5) parameterisation so that exp(ΔA) and the kernel are cheap and stable to compute.

04 · Failure → fixWhy language resisted, and Mamba's answer

Vanilla S4 excelled on audio and long continuous signals but lagged transformers on language. The diagnosis is structural: A̅, B̅, C are fixed, so every token is written into the state with the same dynamics, regardless of what the token is. The model is a time-invariant filter; it cannot do content-based addressing — "ignore this filler word, latch onto that name, recall it when the question arrives" — which is precisely what attention does natively and language constantly demands.

Selective SSMs (Mamba) repair this by making Δ, B, C functions of the current input. The dynamics now depend on content: the model can dilate time (large Δ writes firmly, small Δ lets the token glide past) and modulate what is written and read, token by token. The cost is exact: time-invariance is gone, so the convolutional face is gone. Mamba recovers training parallelism a different way — an associative parallel scan over the recurrence, fused into GPU-friendly kernels — keeping O(T) training and O(1) inference.

The arc in one line: S4 proved linear state dynamics can carry long memory; Mamba added input-dependent gating to give them something attention-like to say; hybrids (see Griffin) mix a little attention back in for exact recall. The three-way framing is in LLM vs RNN vs S4.

05 · PlacementWhere SSMs sit in the landscape

RNN / LSTMTransformerSSM (S4 / Mamba)
Training over length Tsequentialparallel, O(T²)parallel, O(T log T) / scan
Inference stateO(1)KV cache grows with TO(1)
Long-range memorygated, lossyexact lookupstrong but compressed
Content-based recallweaknativeonly with selectivity (Mamba)

The compressed-state column is both the selling point and the caveat: a fixed-size state cannot store an arbitrarily long context verbatim, so needle-in-haystack recall remains attention's home turf, and production systems increasingly hybridise.

Mental Model