Can memory and test-time compute scale together as a single axis?

This explores whether 'how much a model remembers' and 'how much it thinks at inference' are really one combined resource you can turn up together — or two separate dials the corpus treats differently.

This explores whether memory and test-time compute are a single scaling axis, and the corpus suggests they're converging but not identical — they're better understood as two coupled dials that increasingly share the same control loop. The starting point is that test-time compute is already a genuine scaling resource: spending more thinking at inference can substitute for raw model size on hard prompts Can inference compute replace scaling up model size?, and the smartest spending is adaptive — pour compute into hard problems, not easy ones How should we allocate compute budget at inference time?. Memory, meanwhile, has been quietly promoted to a scaling frontier of its own: by late 2025, returns from restructuring how a model stores and retrieves information began exceeding returns from adding parameters Has memory architecture replaced parameter count as the scaling frontier?.

The interesting part is where these two meet. That same memory-architecture work explicitly names 'memory-aware test-time scaling loops' as one of its converging signals — meaning the field is already building systems where the inference-time thinking budget and the memory structure are designed together, not separately. The clearest demonstration that compute axes can merge comes from deep research agents, where search budget turns out to follow the *same* scaling curve as reasoning tokens How does search scale like reasoning in agent systems?. If retrieval scales like reasoning, then 'looking things up' (a memory operation) and 'thinking longer' (a compute operation) start to look like the same knob measured in different units.

But the corpus also pushes back on collapsing everything into one axis. Test-time scaling itself splits cleanly into *internal* (training the model to reason autonomously) and *external* (search and verification at inference), and these complement rather than substitute for each other How do internal and external test-time scaling compare? How should test-time scaling methods be categorized and designed?. There's a similar internal tension in how compute gets spent — parallel scaling buys coverage, sequential scaling buys depth, and the right mix depends on task structure How should we balance parallel versus sequential compute at test time?. When you can't separate these cleanly, framework choice matters less than total budget and the quality of your value function Does the choice of reasoning framework actually matter for test-time performance?. So 'scale them together' only works if you've already decided *which kind* of compute and *which kind* of memory.

The sharpest surprise is that more memory isn't always the goal — sometimes the scaling move is to *forget*. Atom of Thoughts deliberately strips accumulated history, making each reasoning state depend only on the current subproblem, because carried-over history bloats reasoning without improving the answer Can reasoning systems forget history without losing coherence?. That's a direct counterexample to the 'one rising axis' intuition: here, scaling compute works *better* when memory is reduced. Width-based scaling tells a related story — you can sample parallel latent trajectories instead of paying the serial latency of depth, which is a compute move that sidesteps memory accumulation entirely Can reasoning systems scale wider instead of only deeper?.

What the reader probably didn't expect: the deepest coupling isn't at inference at all — it's that pretraining and inference compute aren't independent resources Can inference compute replace scaling up model size?, and you can bake test-time-style thinking into training data itself, getting 3x data efficiency by attaching reasoning traces to harder tokens Can training data augmentation match test-time compute scaling benefits?. And no amount of inference compute lets a non-reasoning model catch a reasoning one — the training regime sets the ceiling Can non-reasoning models catch up with more compute?. So the honest answer is: memory and test-time compute are increasingly *co-designed* and they share scaling laws in agentic settings, but treating them as one fungible axis hides the choices — internal vs. external, parallel vs. sequential, remember vs. forget — that actually determine whether turning the dial helps.

Sources 12 notes

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Has memory architecture replaced parameter count as the scaling frontier?

Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

How should test-time scaling methods be categorized and designed?

Research identifies internal vs external as the primary taxonomic split for test-time scaling, with training-side constraints (policy entropy collapse) and novel directions that shift *when* compute happens (sleep-time, post-completion) rather than just *how much*. Methods like consensus games and recursive LMs sidestep traditional scaling tradeoffs.

How should we balance parallel versus sequential compute at test time?

Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can memory and test-time compute scale together as a single axis?

Sources 12 notes

Next inquiring lines