Sleep-time Compute: Beyond Inference Scaling at Test-time

Paper · arXiv 2504.13171 · Published April 17, 2025
Test Time ComputeQuestion Answer SearchLLM Architecture

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to “think” offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time.

These drawbacks are in part due to the fact that the current approach to applying test-time compute assumes that problems are stateless, i.e. queries (user queries at test-time) and the contexts (background information) required for answering them are provided to the model together at “test-time.” In practice, this means that if multiple related queries require making similar inferences about the context at “test-time,” the model will have to recompute redundant computations each time, incurring additional latency and cost. In reality, many LLM applications are inherently stateful, and work in conjunction with persisted, re-used context. A classic example is document question-answering, where documents contextualize responses to questions. Coding agents also operate on a large common repository and participate in multiple rounds of debugging support, while conversational assistants need to maintain the past dialogue.

make useful inferences about the current state (context) offline before, or even during the user’s next input. We refer to such a process, as sleep-time compute: where inference is done between interactions with the model while it would otherwise be idle in sleep-time. In practice, this is achieved by prompting the model to generate a new context consisting of inferences about the existing context, which may be potentially useful for answering test-time queries. The re-represented context from sleep-time can then be provided in the prompt at test-time, enabling the model to respond to user queries at the accuracy of standard test-time compute but with far lower latencies. For example, a coding assistant at sleep-time may identify architectural patterns, anticipate potential debugging strategies, or infer optimizations prior to the user input. Moreover, users might ask multiple queries about the same context. In these settings, any inferences made during sleep-time can be shared across queries, effectively amortizing the cost of sleep-time compute and reducing the total average cost per query.