Language Models Need Sleep

Paper · arXiv 2605.26099
Novel LLM ArchitecturesLLM MemoryLLM ArchitectureContext Engineering

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During the sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to the sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

However, scalable memory is not the same as scalable reasoning. A fast weight memory may support long-range recall, but it is unclear whether it can support deep computation over tokens that are no longer present in the KV cache. We find that the performance of vanilla SSM-attention hybrid models degrades (under the same token budget) as the required reasoning depth increases even when the amount of information to store is held fixed. This suggests that the bottleneck is not merely memory capacity as suggested by prior work, but the amount of computation available for transforming evicted context into a useful internal state.

In animals, the transfer from short-term memory to long-term memory is thought to be supported by hippocampal replay, especially during sleep; in this phase, short-term hippocampal memories are reactivated and consolidated into cortical synaptic weights. Inspired by these biological processes, we propose a method for transferring context-window memory into persistent weights. When the model's context window becomes full during inference, the model enters a "sleep" in which it performs multiple forward passes over the accumulated context and recursively updates its fast weights via a learned local rule. As in animal sleep, the model receives no external input tokens during this phase. After consolidation, the context window is cleared, and the model resumes operation with updated fast weights. Our key insight is that recurrence can be used not only for prediction but also for memory consolidation.

We propose a sleep-like process in which a model performs multiple recursive forward passes to iteratively refine its fast weights before evicting the corresponding context from the attention cache. Unlike vanilla attention-SSM hybrid model, sleep allows models to reason deeply about past context that they can no longer attend to. Across controlled synthetic tasks and a more realistic mathematical reasoning benchmark, we show that increasing the number of recursions, or sleep duration, improves the model's ability to perform deep sequential computation over evicted context.