INQUIRING LINE

Can offline context optimization reduce test-time latency like sleep-time compute?

This explores whether doing the heavy 'thinking' work ahead of time — consolidating context offline before a query arrives — can cut the wait a user feels at query time, the way 'sleep-time compute' precomputes state during idle periods.


This explores whether you can move expensive context-processing out of the moment a user is waiting and into an offline phase, the way sleep-time compute does — and the corpus says yes, with the most direct evidence being a reframing of what the latency cost actually is. The standout finding is that the long-context bottleneck isn't memory capacity but the *compute* needed to fold evicted context into the model's fast weights, and that this consolidation can happen during an offline 'sleep' phase rather than at query time Is long-context bottleneck really about memory or compute?. Crucially, more consolidation passes keep improving performance — so the offline phase behaves like test-time scaling, just relocated to a moment when nobody is waiting. That's the whole premise of your question, confirmed: the work still happens, but it stops sitting on the critical path.

The deeper payoff is seeing this as one instance of a general move the corpus keeps making — pay the compute when latency doesn't matter. Thinking-augmented pre-training does the same trick at training time, baking generated reasoning traces into the data so a small model gets 3x more efficient and needs fewer tokens at inference Can training data augmentation match test-time compute scaling benefits?. And the field's own taxonomy formalizes this split: 'internal' scaling trains capability in ahead of time, while 'external' scaling extracts performance at inference — they complement rather than compete How do internal and external test-time scaling compare?. Offline context optimization is essentially shifting work from the external (latency-bearing) column into the internal (precomputed) one.

But there's a sharp caveat the corpus surfaces: relocating compute only helps if the precomputed state is actually *productive*. Reasoning models outperform non-reasoning ones at any inference budget because training instilled a protocol that makes extra tokens pay off — raw precomputed compute without the right structure doesn't close the gap Can non-reasoning models catch up with more compute?. So offline consolidation isn't a free latency win; it's a bet that the offline work encodes something the model can cheaply exploit later.

There's also a second, complementary route to the same latency goal that doesn't require an offline phase at all: decouple the slow part and run it *alongside* generation. Asynchronous verifiers can police a reasoning trace in parallel, adding near-zero latency on correct runs because the policing never blocks the main stream Can verifiers monitor reasoning without slowing generation down?. That's worth knowing because it reframes 'offline vs. test-time' as a spectrum of *when you pay* — before (consolidation), during-but-parallel (async verification), or adaptively per-prompt, spending budget only on hard inputs and sparing easy ones How should we allocate compute budget at inference time?.

The thing you might not have known you wanted: even the hardware layer plays the same game. On memory-bound mobile chips, MobileLLM finds it's cheaper to *recompute* a transformer block twice than to move separate weights across the memory bus Does recomputing weights cost less than moving them on mobile?. Across software, training, and silicon, the corpus keeps rediscovering one principle — latency is about *where and when* you spend compute relative to the bottleneck, not how much you spend. Sleep-time consolidation is just the most literal version of moving the spend to where it's free.


Sources 7 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can verifiers monitor reasoning without slowing generation down?

Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Next inquiring lines