Is long-context bottleneck really about memory or compute?
Explores whether the challenge of handling long context windows stems from storage capacity limits or from the computational cost of transforming context into internal state. Understanding this distinction reshapes how we design language models.
The standard framing of the long-context problem is capacity: attention scales poorly with context length, the KV cache grows, and we run out of room. "Language Models Need Sleep" reframes it as a compute-allocation problem. When the context window fills, the model enters a "sleep" — it performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space-model blocks through a learned local rule, then clears the KV cache and resumes. The information that would be lost on eviction is not stored verbatim; it is transformed into internal state by spending compute.
This relocates the bottleneck. The question is not "how much can we hold?" but "how much compute do we spend converting recent context into persistent weights, and when?" The design shifts that compute to the sleep phase, preserving wake-time prediction latency. The empirical signature confirms it is a compute story: increasing sleep duration N improves performance, with the largest gains on examples that require deeper reasoning — more offline compute buys more capability on hard cases, exactly the test-time-scaling pattern moved to an offline window.
The reframe is significant because it dissolves the capacity ceiling rather than raising it. A capacity solution adds memory; a compute solution adds passes. This connects to the vault's emerging theme that when a model thinks is as designable as how much — since When should AI systems do their thinking?, shifting inference to idle windows is a third temporal position for compute, and the sleep-consolidation mechanism is its architectural realization inside the weights. It also relates to alternatives that attack the capacity framing differently — since Can neural memory modules scale language models beyond attention limits?, one can add a long-term memory module instead of consolidating into fast weights. Counterpoint: spending compute on consolidation is only a win if the offline budget is genuinely free; under continuous load with no idle time, the sleep cost competes with serving. Why it matters: it tells architects to budget consolidation compute rather than chase ever-larger context windows.
— "Language Models Need Sleep", https://arxiv.org/abs/2605.26099
Related concepts in this collection
-
When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
the general principle of shifting inference to idle windows; sleep-consolidation realizes it inside the weights
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
an alternative that adds long-term memory capacity rather than consolidating into fast weights
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
another non-capacity approach: prune the KV cache rather than consolidate it into weights
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
the long-context bottleneck is compute to transform evicted context into internal state not memory capacity