Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
The standard model of test-time compute treats each query as stateless — context and query arrive together, model thinks, response is generated. But most real LLM applications are stateful: a coding agent operates on a persistent repository, a document QA system uses the same documents across many questions, a conversational assistant maintains an ongoing history.
Sleep-time compute exploits this statefulness. Between interactions — when the model would otherwise be idle — it can pre-compute inferences about the context: anticipated questions, architectural patterns in code, likely debugging paths. At query time, these pre-computed inferences are provided alongside the prompt, allowing the model to respond with far less latency while maintaining the accuracy of heavier compute.
The economic logic is amortization. If multiple queries share the same context, any sleep-time compute applied to that context is amortized across all those queries. The per-query cost drops even as total accuracy is preserved.
This reframes the design question: instead of "how much compute should the model use when answering?", the question becomes "when should compute happen?" — and the answer is often before the user asks, not during. See the writing angle When should AI systems do their thinking?.
Think-in-Memory as conversational sleep-time compute (2311.08719): TiM applies the sleep-time principle to conversational memory. After generating a response, the agent post-thinks — integrating historical and new thoughts to update an evolved memory using insert/forget/merge operations. Future queries retrieve pre-reasoned thoughts rather than re-deriving them from raw history. This eliminates inconsistent reasoning paths (different conclusions from the same evidence recalled for different questions) by ensuring reasoning about history happens once and persists. The memory evolves through explicit operations rather than accumulating raw context, making it a concrete implementation of sleep-time compute for the multi-turn conversation use case. See Can storing evolved thoughts prevent inconsistent reasoning in conversations?.
Source: Test Time Compute; enriched from Memory
Related concepts in this collection
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
a complementary rethinking of how to allocate compute
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
both show deployment context (statefulness / training regime) matters as much as raw compute
-
How should we categorize different test-time scaling approaches?
Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
sleep-time compute is a third category: pre-interaction TTS that fits neither internal nor external
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans' persistent memory architecture is a natural implementation substrate for sleep-time compute: the adaptive memory can store precomputed inferences that persist across interactions, and its surprise-based update mechanism naturally prioritizes novel precomputed insights
-
Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
complementary inference-time adaptation: proxy-tuning applies domain adaptation at decoding time without weight modification, sleep-time compute applies reasoning pre-computation between interactions; both demonstrate that significant model behavior changes can be achieved without retraining
-
Can models treat long prompts as external code environments?
Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
complementary reframing: sleep-time compute separates WHEN to process context (before vs. during query), while RLMs separate WHERE context lives (external environment vs. context window); both reject the default of stuffing everything into the window at query time
-
Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
parallel rebalancing of the retrieval pipeline: LongRAG shifts work from retriever to reader within a single query; sleep-time compute shifts work from query time to pre-query time; both challenge the assumption that query-time retrieval is where intelligence must concentrate
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
sleep-time compute reduces test-time latency by precomputing over stateful context