How does per-token adaptive compute improve efficiency in recurrent reasoning?

This explores whether spending different amounts of compute on different tokens — instead of a fixed budget for everything — makes recurrent and iterative reasoning systems more efficient, and the corpus reframes that question in a few surprising ways.

This explores whether spending different amounts of compute on different tokens — instead of a flat budget for everything — makes recurrent reasoning more efficient. The corpus's strongest claim is that not all tokens deserve equal compute, because they don't carry equal weight. When models prune their own reasoning chains, they preferentially keep symbolic computation tokens and discard grammar and filler first Which tokens in reasoning chains actually matter most?. The training-time mirror of this is even sharper: only about 20% of tokens are high-entropy 'forking points' where the model actually decides what to do next, and training on just those matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. So the premise underneath adaptive compute is real — a minority of tokens is where the reasoning lives, and a system that pours equal effort into the rest is wasting it.

The most direct answer to your question is that adaptivity pays off at two levels: per-prompt and per-turn. Compute-optimal scaling shows that giving easy prompts less inference and hard prompts more — same total budget, reallocated — beats a uniform budget and can outperform simply using a bigger model Can we allocate inference compute based on prompt difficulty?. Models can even learn this routing themselves, choosing between extended thinking and a quick answer without being told which problems are hard Can models learn when to think versus respond quickly?. And in multi-turn agentic search, the twist is that *less* per-step compute can be better: capping reasoning per turn preserves the context window for incorporating new evidence across rounds, where unbounded reasoning erodes it Does limiting reasoning per turn improve multi-turn search quality?.

Where the corpus gets genuinely interesting for *recurrent* reasoning specifically is that the depth doesn't have to live in emitted tokens at all. The Hierarchical Reasoning Model couples a slow abstract planner with a fast detailed loop across two timescales, reaching computational depth that fixed-depth transformers can't — solving Sudoku and mazes where chain-of-thought fails, with only 27M parameters Can recurrent hierarchies achieve reasoning that transformers cannot?. That's adaptive compute internalized into latent recurrence rather than spent on a long visible token stream. A complementary move is to scale *width* instead of depth: sampling parallel latent trajectories sidesteps the serial latency of just thinking longer Can reasoning systems scale wider instead of only deeper?.

Two notes complicate the tidy story and are worth knowing. First, the efficiency gain may not come from the reasoning being *meaningful*: models trained on deliberately corrupted, irrelevant traces perform comparably to those trained on correct ones, which suggests reasoning tokens often act as raw computational scaffolding rather than logical steps Do reasoning traces need to be semantically correct?. If tokens are scaffolding, then *how many* you spend matters more than precisely *which* — adaptivity is about allocating raw compute, not curating logic. Second, you can't cheaply buy reasoning at inference time if it wasn't trained in: non-reasoning models don't catch up to reasoning models no matter how large the inference budget, because training is what makes extra tokens productive in the first place Can non-reasoning models catch up with more compute?.

The thread that ties these together — and the thing you might not have known you were asking — is that 'adaptive compute' keeps getting relocated. The long-context bottleneck turns out to be compute, not memory: the cost is consolidating evicted context into fast weights, and more consolidation passes help on harder problems, which is test-time scaling wearing a different hat Is long-context bottleneck really about memory or compute?. Neural memory modules push the same logic into storage, spending capacity adaptively on 'surprising' tokens rather than all of them Can neural memory modules scale language models beyond attention limits?. Per-token adaptive compute, in other words, is one instance of a broader principle the corpus keeps rediscovering: match effort to where the difficulty actually is — across tokens, prompts, turns, memory, and recurrent depth alike.

Sources 11 notes

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

How does per-token adaptive compute improve efficiency in recurrent reasoning?

Sources 11 notes

Next inquiring lines