How should inference compute budget be allocated across different prompt difficulties?

This explores how to spend a fixed inference-time compute budget wisely — giving harder prompts more thinking and easier ones less, rather than spending the same on every query.

This explores how to allocate inference compute across prompts of varying difficulty — and the corpus is unusually unified on the headline: uniform spending is wasteful. The cleanest statement is that adaptive allocation, where easy prompts get less compute and hard ones get more, beats fixed per-prompt budgets, and can even beat simply using a larger model under a uniform budget Can we allocate inference compute based on prompt difficulty?, How should we allocate compute budget at inference time?. The intuition: easy problems are already solved on the first pass, so extra tokens there are pure waste, while hard problems are starved under an averaged budget.

The more surprising finding is that inference compute and model size are partly interchangeable. On hard prompts, a smaller model given more inference compute can match a much larger one — which means pretraining and inference are not separate resource pools but tradeable against each other Can inference compute replace scaling up model size?. But there's a ceiling: more compute only pays off if the model was trained to use it. A non-reasoning model doesn't catch up to a reasoning model no matter how large its inference budget, because the reasoning model learned a protocol that makes additional tokens productive Can non-reasoning models catch up with more compute?. So 'spend more on hard prompts' assumes the model knows how to convert that spend into better answers.

The deeper question the corpus pushes toward is: who decides a prompt is hard? One answer is to let the model route itself. Thinkless trains a single model to choose between extended reasoning and a quick direct answer, learning this calibration without explicit difficulty labels and avoiding the failure where it collapses to always-think or never-think Can models learn when to think versus respond quickly?. That reframes budget allocation from an external scheduling problem into a learned skill inside the model.

A subtlety worth knowing: allocation isn't just one knob. The same diminishing-returns scaling curve shows up on multiple axes — search iterations in agentic research behave like reasoning tokens, so a model can trade reasoning budget against search budget to hit a quality target Does search budget scale like reasoning tokens for answer quality?. And even long-context handling turns out to be a test-time-scaling problem in disguise, where more consolidation passes help most on harder reasoning tasks Is long-context bottleneck really about memory or compute?. Allocation is really about distributing a budget across several compute dimensions, not just choosing a token count.

Finally, allocation shouldn't be optimized in isolation. Prompts tuned without knowledge of the inference strategy — best-of-N, majority voting — systematically underperform, while jointly optimizing the prompt and the inference method together yields up to 50% gains Does prompt optimization without inference strategy fail?. The lesson across the corpus: a compute budget is well spent only when difficulty estimation, model training, and prompt design are all aware of each other.

Sources 8 notes

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does prompt optimization without inference strategy fail?

Prompts optimized without knowledge of the inference strategy (best-of-N, majority voting) systematically underperform. Joint optimization of both prompt and inference strategy yields up to 50% improvement across reasoning and generation tasks.

How should inference compute budget be allocated across different prompt difficulties?

Sources 8 notes

Next inquiring lines