INQUIRING LINE

Can sleep-time compute reduce latency demands during model inference?

This explores whether doing computation 'offline' — ahead of a user's request, during idle or 'sleep' phases — can lower the work a model has to do in real time when someone is waiting for an answer.


This explores whether moving work off the critical path — precomputing during idle 'sleep' phases so the model does less while a user waits — can cut inference latency. The corpus doesn't have a paper named 'sleep-time compute,' but it has the underlying idea in a sharper form, plus several pieces that reframe what latency even is. The clearest match is the long-context work showing the real bottleneck isn't memory but the *compute* needed to fold accumulated context into the model's fast weights — and that this folding happens best during offline consolidation passes, with performance improving the more passes you run Is long-context bottleneck really about memory or compute?. That's the heart of the sleep-time bet: pay the consolidation cost when no one is waiting, so the live request reads from an already-digested state instead of recomputing from scratch.

What makes this more than a trick is that inference compute genuinely trades against other resources. Smaller models given more inference compute can match larger ones on hard prompts, meaning pretraining and inference budgets aren't independent dials Can inference compute replace scaling up model size?. If you can shift some of that compute earlier — into a sleep phase — you're effectively spending a different, latency-free budget to buy the same accuracy. Persistent neural-memory modules push the same direction: they store and compress 'surprising' tokens into a long-term store ahead of time, so the live attention pass stays short instead of paying a quadratic price over a giant context Can neural memory modules scale language models beyond attention limits?.

The corpus also says something important about *where the latency actually hurts*, which reframes the question. Depth-only reasoning is serial — each step waits on the last — so the painful latency is in long sequential chains. One answer is to scale width instead: sample parallel latent trajectories that explore the solution space at once, sidestepping the serial cost rather than precomputing around it Can reasoning systems scale wider instead of only deeper?. Another is to simply not spend the compute when it isn't needed: adaptive allocation gives easy prompts a small budget and saves the heavy thinking for hard ones Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?, and models can even learn to route between a fast direct answer and slow extended thinking on their own Can models learn when to think versus respond quickly?. Sleep-time compute and these are complementary levers — precompute reduces the *baseline* work, adaptive routing reduces the *wasted* work.

There's a real limit worth knowing, though. Inference compute isn't a free substitute for everything. A non-reasoning model can't be rescued by pouring tokens at it — the training regime, not the inference budget, is what makes extra compute productive Can non-reasoning models catch up with more compute?. The same caution applies to sleep-time tricks: offline consolidation lowers latency only when the model already knows how to *use* the consolidated state. The lever that's quietly underrated here is architecture — choices like hidden size, MLP-to-attention ratio, and grouped-query attention delivered ~42% throughput gains *and* higher accuracy under the same training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. So the honest synthesis is: yes, moving compute into idle phases can reduce live latency, but the biggest wins come from combining it with adaptive spending, width-based parallelism, and an architecture built for cheap inference in the first place.


Sources 9 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Next inquiring lines