Can state-space models match transformers at copying and retrieval?
Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.
The efficiency case for generalized state-space models (GSSMs — S4, Mamba, linear attention, parallel RNNs) is that they use an O(1) fixed-size latent state instead of the transformer's Ω(L) memory. This paper asks what that buys is paid for, and proves a sharp limit: a two-layer transformer can copy strings of exponential length, while GSSMs are fundamentally bounded by their fixed-size state. Empirically, transformers beat GSSMs at copying and context-retrieval on synthetic tasks, and pretrained transformer LLMs dramatically outperform state-space LLMs at copying and retrieving information from context.
The keeper is the mechanism-level trade-off: a fixed-size memory cannot losslessly hold arbitrary context, so any task that requires reproducing or retrieving from the input verbatim has a hard ceiling for GSSMs that transformers don't face. This is the precise capability cost of the efficiency that makes linear-attention architectures attractive. The authors' constructive suggestion — hybrid architectures that give SSMs an attention-like retrieval mechanism — is now the dominant design response.
This grounds the efficiency-vs-capability tension in the vault's architecture thread. It is the cautionary counterweight to Can spiking neurons make transformers efficient on any hardware? — linear/spiking attention buys efficiency, but this proof says the fixed state pays for it in copying and retrieval, which is why SpikingBrain and others use hybrid-linear rather than pure-linear attention.
Inquiring lines that use this note as a source 13
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can spline-based activations replace MLPs in transformer architectures?
- Does recurrent memory or gist compression work better for ultra-long context?
- How do memory hierarchies and compression reduce context management demands?
- Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?
- Why does attending to own latents work better than bolted-on external memory stores?
- Does retrieval quality depend more on access structure or write gating?
- Can externalizing bookkeeping to a stateful harness replace internalized memory control?
- How do fixed recurrent states trade off copying accuracy for filtering ability?
- How does structured environment state compare to transcript replay for multi-turn reasoning?
- Does attention linearity alone explain the efficiency gains over standard transformers?
- How does temporal grounding in retrieval compare to architectural approaches?
- What are the concrete efficiency gains of linear-attention state-space models?
- Can fixed-size latent states losslessly store arbitrary input context?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can spiking neurons make transformers efficient on any hardware?
Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.
the efficiency play this proof bounds; explains why hybrid-linear beats pure-linear
-
Can recurrent memory scale where attention fails on ultra-long text?
GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
counterpoint on the long-context axis: recurrent memory can win where attention degrades, but at a different task profile
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Repeat After Me: Transformers are Better than State Space Models at Copying
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
- Titans: Learning to Memorize at Test Time
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
- A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
- Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
- Compositional Reasoning with Transformers, RNNs, and Chain of Thought
Original note title
transformers provably beat state-space models at copying and retrieving from context because a fixed-size latent state cannot