How do sleep-time and post-completion methods reduce inference latency?

This explores two distinct tricks for cutting the time a user waits on an LLM: doing heavy work *before* the query arrives ('sleep-time'), and *stopping work early* once an answer is good enough ('post-completion' / early-stopping) — and the corpus actually frames these as part of a larger move to spend compute where it pays off rather than uniformly.

This explores two distinct tricks for cutting the time a user waits on an LLM: doing heavy work *before* the query arrives ('sleep-time'), and *stopping work early* once an answer is good enough — and the corpus treats both as instances of the same deeper idea, that latency is mostly about *where* you spend compute, not how much you have.

The sleep-time angle is clearest in work re-framing the long-context problem. The intuition is that long context is slow because the model has to re-digest enormous inputs at query time. But Is long-context bottleneck really about memory or compute? argues the real bottleneck isn't memory at all — it's the *compute* needed to fold context into the model's fast internal state. The fix is to do that consolidation during 'offline sleep phases,' before the user ever asks, so the expensive transformation is already paid for. More consolidation passes keep improving results, which means you can shift an arbitrary amount of work out of the live path. A related architectural version of this appears in Can neural memory modules scale language models beyond attention limits?, where a long-term memory module compresses and stores 'surprising' tokens ahead of time, so the model isn't paying a quadratic attention cost over millions of tokens at inference.

The post-completion / early-stopping angle comes at latency from the other end: don't run the full reasoning trace if you don't need to. Does step-level confidence outperform global averaging for trace filtering? shows that watching confidence *step by step* lets a system bail out of a bad reasoning trace before it finishes — catching breakdowns that whole-trace averaging hides, and reaching the same accuracy with far fewer generated tokens. That's latency saved by not completing work that won't help.

What makes this interesting is that the corpus keeps circling the same principle from different directions. Can we allocate inference compute based on prompt difficulty? gives easy prompts less and hard ones more, beating fixed budgets. Can models learn when to think versus respond quickly? trains a model to *decide* when to think hard versus answer instantly. Can reasoning systems scale wider instead of only deeper? sidesteps the serial cost of deep reasoning by sampling parallel paths instead of one long chain. And Does limiting reasoning per turn improve multi-turn search quality? caps reasoning per turn so an agent doesn't burn its budget — and context — before it's done searching.

The thing you might not have known you wanted to know: 'reduce latency' isn't one technique, it's a spectrum of *when you choose to spend.* Sleep-time methods push compute earlier (pre-pay before the query); early-stopping pushes it later and conditional (don't pay if you don't need to); adaptive routing decides per-prompt. All three beat the naive approach of running a fixed, full pipeline every single time — which is the real source of wasted waiting.

Sources 7 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

How do sleep-time and post-completion methods reduce inference latency?

Sources 7 notes

Next inquiring lines