Reasoning and Learning Architectures

Is long-context bottleneck really about memory or compute?

Explores whether the challenge of handling long context windows stems from storage capacity limits or from the computational cost of transforming context into internal state. Understanding this distinction reshapes how we design language models.

Note · 2026-05-28 · sourced from Novel Architectures

The standard framing of the long-context problem is capacity: attention scales poorly with context length, the KV cache grows, and we run out of room. "Language Models Need Sleep" reframes it as a compute-allocation problem. When the context window fills, the model enters a "sleep" — it performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space-model blocks through a learned local rule, then clears the KV cache and resumes. The information that would be lost on eviction is not stored verbatim; it is transformed into internal state by spending compute.

This relocates the bottleneck. The question is not "how much can we hold?" but "how much compute do we spend converting recent context into persistent weights, and when?" The design shifts that compute to the sleep phase, preserving wake-time prediction latency. The empirical signature confirms it is a compute story: increasing sleep duration N improves performance, with the largest gains on examples that require deeper reasoning — more offline compute buys more capability on hard cases, exactly the test-time-scaling pattern moved to an offline window.

The reframe is significant because it dissolves the capacity ceiling rather than raising it. A capacity solution adds memory; a compute solution adds passes. This connects to the vault's emerging theme that when a model thinks is as designable as how much — since When should AI systems do their thinking?, shifting inference to idle windows is a third temporal position for compute, and the sleep-consolidation mechanism is its architectural realization inside the weights. It also relates to alternatives that attack the capacity framing differently — since Can neural memory modules scale language models beyond attention limits?, one can add a long-term memory module instead of consolidating into fast weights. Counterpoint: spending compute on consolidation is only a win if the offline budget is genuinely free; under continuous load with no idle time, the sleep cost competes with serving. Why it matters: it tells architects to budget consolidation compute rather than chase ever-larger context windows.


— "Language Models Need Sleep", https://arxiv.org/abs/2605.26099

Related concepts in this collection

Concept map
14 direct connections · 93 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

the long-context bottleneck is compute to transform evicted context into internal state not memory capacity