LLM Reasoning and Architecture

When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The entire test-time scaling literature implicitly assumes inference happens when a query arrives. Sleep-time compute challenges this temporal assumption: in stateful applications, the model can "think" between interactions — precomputing inferences about persistent context that will be useful when queries arrive.

This is a spatial/temporal reframing, not just an efficiency trick. It makes a conceptual distinction between:

Current test-time compute bundles context processing and query answering into the same inference call, forcing all thinking to happen at query time. Sleep-time compute separates them: process context when convenient, answer queries when required.

The implications cascade: latency drops (the expensive thinking is pre-done), cost amortizes across multiple queries sharing the same context, and the model can invest more sophisticated reasoning in context processing than would be economically feasible at query time.

The deeper reframe: "thinking" is not a response to queries. It's a process that happens on a different timescale. Designing AI systems around this distinction could change inference architecture fundamentally.


Source: Test Time Compute

Related concepts in this collection

Concept map
16 direct connections · 123 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

sleep-time compute reframes when AI thinks not how much it thinks