LLM Reasoning and Architecture

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The standard model of test-time compute treats each query as stateless — context and query arrive together, model thinks, response is generated. But most real LLM applications are stateful: a coding agent operates on a persistent repository, a document QA system uses the same documents across many questions, a conversational assistant maintains an ongoing history.

Sleep-time compute exploits this statefulness. Between interactions — when the model would otherwise be idle — it can pre-compute inferences about the context: anticipated questions, architectural patterns in code, likely debugging paths. At query time, these pre-computed inferences are provided alongside the prompt, allowing the model to respond with far less latency while maintaining the accuracy of heavier compute.

The economic logic is amortization. If multiple queries share the same context, any sleep-time compute applied to that context is amortized across all those queries. The per-query cost drops even as total accuracy is preserved.

This reframes the design question: instead of "how much compute should the model use when answering?", the question becomes "when should compute happen?" — and the answer is often before the user asks, not during. See the writing angle When should AI systems do their thinking?.

Think-in-Memory as conversational sleep-time compute (2311.08719): TiM applies the sleep-time principle to conversational memory. After generating a response, the agent post-thinks — integrating historical and new thoughts to update an evolved memory using insert/forget/merge operations. Future queries retrieve pre-reasoned thoughts rather than re-deriving them from raw history. This eliminates inconsistent reasoning paths (different conclusions from the same evidence recalled for different questions) by ensuring reasoning about history happens once and persists. The memory evolves through explicit operations rather than accumulating raw context, making it a concrete implementation of sleep-time compute for the multi-turn conversation use case. See Can storing evolved thoughts prevent inconsistent reasoning in conversations?.


Source: Test Time Compute; enriched from Memory

Related concepts in this collection

Concept map
18 direct connections · 161 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

sleep-time compute reduces test-time latency by precomputing over stateful context