Can post-thinking compute on memory reduce query-time reasoning costs?

This explores whether doing the heavy lifting up front — consolidating, compressing, or pruning what's in memory *before* a query arrives — can lower the compute you pay when the model actually reasons in response to that query.

This explores whether you can shift compute out of the live query and into a prep phase on memory, so that when a question arrives the model has less work to do. The corpus's sharpest support for this comes from work reframing the long-context problem itself: the bottleneck isn't how much you can store, it's the compute needed to *transform* evicted context into fast internal state, and that transformation can happen offline in 'sleep' phases Is long-context bottleneck really about memory or compute?. Crucially, performance improves with more consolidation passes — meaning you can spend test-time-scaling-style effort ahead of time and bank the result, rather than paying it at every query.

The complementary move is making memory smaller and cleaner before the question lands. Autonomous memory folding compresses an agent's interaction history into structured episodic, working, and tool schemas, cutting token overhead while preserving what matters — and the structure is what avoids the degradation that naive compression causes Can agents compress their own memory without losing critical details?. From a different angle, recursive subtask trees with rule-based KV-cache pruning sustain accurate reasoning even after discarding 90% of the cache, letting a single model carry working memory that would otherwise demand a multi-agent setup Can recursive subtask trees overcome context window limits?. Both say the same thing in different vocabularies: curate the state, don't drag the whole history forward.

There's a more radical version worth knowing about — what if you carry almost no history at all? Atom of Thoughts contracts problems into a chain where each state depends only on the current problem, not the accumulated past, eliminating the 'historical baggage' that bloats reasoning while keeping answers equivalent Can reasoning systems forget history without losing coherence?. This matters because longer inputs aren't free: reasoning accuracy drops from 92% to 68% with just 3,000 tokens of padding, far below the context limit, and chain-of-thought doesn't rescue it Does reasoning ability actually degrade with longer inputs?. So pre-processing memory isn't only a cost play — a leaner, pre-digested state can actually reason *better*, not just cheaper.

The deeper lesson tying these together is that *where* you invest compute matters more than *how much* you spend at query time. Non-reasoning models never catch up to reasoning models no matter how large their inference budget, because the training regime — not the live token spend — is what makes additional thinking productive Can non-reasoning models catch up with more compute?. Post-thinking compute on memory is the same principle pushed into the deployment layer: front-load the work that makes later reasoning efficient. The honest caveat is that the corpus shows this mostly for *consolidation and pruning*, not a clean general result that offline memory work provably substitutes for query-time reasoning — the evidence points strongly that direction, but it's assembled from adjacent findings rather than one paper that proves the trade directly.

Sources 6 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can post-thinking compute on memory reduce query-time reasoning costs?

Sources 6 notes

Next inquiring lines