How should inference-time token budgets vary across models of different capability levels?
This explores how much 'thinking time' (inference tokens) you should give a model — and whether that budget should depend on how capable the model already is, or instead on something else entirely.
This question reads as: if a weak model and a strong model are solving the same problem, should we hand them different token budgets? The corpus suggests the more useful axis isn't the model's capability tier at all — it's the difficulty of the prompt and whether the model was trained to use extra tokens productively in the first place.
Start with the surprising part: spending more tokens doesn't help a model that wasn't trained to reason. Even with unlimited inference budget, non-reasoning models can't close the gap with reasoning models, because the gap lives in the training regime — a reasoning model has internalized a protocol that makes each additional token do work, while a non-reasoning model just generates more of the same Can non-reasoning models catch up with more compute?. So budget is not a dial you can turn to buy capability. That said, the relationship between size and compute is genuinely fungible in one direction: a smaller model given more inference compute can match a larger one specifically on hard prompts, which means pretraining compute and inference compute trade off against each other rather than being separate resources Can inference compute replace scaling up model size?.
The real lever the corpus keeps returning to is per-prompt difficulty, not per-model tier. Allocating the same total compute adaptively — starving easy prompts and feeding hard ones — beats spending a uniform budget everywhere, and can even beat a larger model running under a flat budget Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?. This reframes the whole question: instead of asking 'how much should this model get,' ask 'how much should this prompt get.' And the most elegant version of that is letting the model decide for itself — Thinkless trains a single model to route between extended thinking and a quick direct answer, calibrating its own budget without anyone labeling which problems are hard Can models learn when to think versus respond quickly?.
More is also actively dangerous past a point. Accuracy is non-monotonic: pushing thinking tokens from ~1,100 up to ~16K dropped benchmark accuracy from 87% to 70%, because the model overthinks easy problems and talks itself out of correct answers Does more thinking time always improve reasoning accuracy?. So a bigger budget isn't a safe default even for a capable model — there's a sweet spot, and overshooting it is a failure mode. This connects to why budgets work at all: only about 20% of tokens are high-entropy 'forking points' where the real reasoning decisions happen, and those carry the learning signal Do high-entropy tokens drive reasoning model improvements?; reasoning chains also internally rank tokens by function, preserving symbolic computation while shedding filler Which tokens in reasoning chains actually matter most?. Extra budget only pays off when it lands on those pivotal tokens.
Two lateral moves worth knowing about. First, 'budget' isn't only reasoning tokens — agentic research shows search iterations follow the same test-time scaling curve, opening a second axis you can trade against thinking depth Does search budget scale like reasoning tokens for answer quality?. Second, you can scale sideways instead of deeper: sampling parallel latent trajectories buys you coverage of the solution space without the serial latency of one very long chain Can reasoning systems scale wider instead of only deeper?. The takeaway across all of this: capability level mostly determines whether tokens are useful at all (a training question), while prompt difficulty determines how many to spend (an allocation question) — and the frontier is teaching models to make that allocation call themselves.
Sources 10 notes
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.