Can state-space models match transformers at copying and retrieval?

Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.

Synthesis note · 2026-06-03 · sourced from Novel Architectures

The efficiency case for generalized state-space models (GSSMs — S4, Mamba, linear attention, parallel RNNs) is that they use an O(1) fixed-size latent state instead of the transformer's Ω(L) memory. This paper asks what that buys is paid for, and proves a sharp limit: a two-layer transformer can copy strings of exponential length, while GSSMs are fundamentally bounded by their fixed-size state. Empirically, transformers beat GSSMs at copying and context-retrieval on synthetic tasks, and pretrained transformer LLMs dramatically outperform state-space LLMs at copying and retrieving information from context.

The keeper is the mechanism-level trade-off: a fixed-size memory cannot losslessly hold arbitrary context, so any task that requires reproducing or retrieving from the input verbatim has a hard ceiling for GSSMs that transformers don't face. This is the precise capability cost of the efficiency that makes linear-attention architectures attractive. The authors' constructive suggestion — hybrid architectures that give SSMs an attention-like retrieval mechanism — is now the dominant design response.

This grounds the efficiency-vs-capability tension in the vault's architecture thread. It is the cautionary counterweight to Can spiking neurons make transformers efficient on any hardware? — linear/spiking attention buys efficiency, but this proof says the fixed state pays for it in copying and retrieval, which is why SpikingBrain and others use hybrid-linear rather than pure-linear attention.

Inquiring lines that use this note as a source 13

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 83 in 2-hop network ·medium cluster Open in graph ↗

Can state-space models match transformers at cop… Can spiking neurons make transformers efficient on… Can recurrent memory scale where attention fails o…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can spiking neurons make transformers efficient on any hardware? Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.
the efficiency play this proof bounds; explains why hybrid-linear beats pure-linear
Can recurrent memory scale where attention fails on ultra-long text? GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
counterpoint on the long-context axis: recurrent memory can win where attention degrades, but at a different task profile

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

transformers provably beat state-space models at copying and retrieving from context because a fixed-size latent state cannot

Can state-space models match transformers at copying and retrieval?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4