When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
The entire test-time scaling literature implicitly assumes inference happens when a query arrives. Sleep-time compute challenges this temporal assumption: in stateful applications, the model can "think" between interactions — precomputing inferences about persistent context that will be useful when queries arrive.
This is a spatial/temporal reframing, not just an efficiency trick. It makes a conceptual distinction between:
- Context (stable background information — a codebase, a document, a conversation history)
- Queries (ephemeral questions about that context)
Current test-time compute bundles context processing and query answering into the same inference call, forcing all thinking to happen at query time. Sleep-time compute separates them: process context when convenient, answer queries when required.
The implications cascade: latency drops (the expensive thinking is pre-done), cost amortizes across multiple queries sharing the same context, and the model can invest more sophisticated reasoning in context processing than would be economically feasible at query time.
The deeper reframe: "thinking" is not a response to queries. It's a process that happens on a different timescale. Designing AI systems around this distinction could change inference architecture fundamentally.
Source: Test Time Compute
Related concepts in this collection
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
the implementation of this reframing
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
a complementary rethinking of *how* to allocate compute
-
How should we categorize different test-time scaling approaches?
Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
sleep-time compute is neither internal nor external; it fractures the dichotomy by shifting inference to a third temporal position
-
Can models treat long prompts as external code environments?
Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
parallel spatial reframing: sleep-time asks WHEN to compute, RLMs ask WHERE to keep data; together they define two independent axes for rethinking inference architecture beyond the "everything in the context window at query time" default
-
Can storing evolved thoughts prevent inconsistent reasoning in conversations?
When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?
concrete instantiation in conversational systems: TiM post-thinks between turns, exactly the temporal reframing this note proposes — reasoning happens after responses (not at queries) and persists as evolved memory
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
sleep-time compute reframes when AI thinks not how much it thinks