Can inference budgets be allocated differently based on prompt difficulty?

This explores whether an LLM can spend more compute on hard prompts and less on easy ones — and whether that adaptive allocation beats spending the same fixed amount on every prompt.

This explores whether inference compute can be allocated by prompt difficulty rather than spent uniformly. The corpus answers yes, and clearly: the effectiveness of extra inference compute varies dramatically with how hard a prompt is, so giving easy prompts less and hard prompts more — while keeping the total budget fixed — substantially outperforms uniform spending, and can even beat simply using a larger model under a flat budget Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?. The intuition is that uniform budgets waste tokens on problems already solved while starving the ones that actually need the thinking.

The more interesting question is *who decides* the budget. One approach learns the routing directly: rather than relying on hand-labeled difficulty, a model can be trained to choose between extended step-by-step thinking and a quick direct answer, calibrating itself to the prompt. Thinkless uses a decoupled reinforcement learning scheme that separates the 'how much to think' decision from the 'what's the answer' decision, which keeps the model from collapsing into always-think or never-think Can models learn when to think versus respond quickly?. That reframes difficulty-based allocation as something the model can internalize, not just a knob an external scheduler turns.

There's an important boundary, though: extra inference budget only pays off if the model was trained to use it. Reasoning models persistently beat non-reasoning models no matter how much inference compute you throw at the weaker model, because training instills a protocol that makes additional tokens productive — without it, more tokens are just more tokens Can non-reasoning models catch up with more compute?. So 'allocate more to hard prompts' presupposes a model that can convert that allowance into better answers. Relatedly, the corpus shows reasoning effort isn't universally good even within a capable model: for some simple questions, forcing step-by-step reasoning actually *hurts*, and a direct question-to-answer path wins Why do some questions perform better without step-by-step reasoning?. Difficulty allocation, then, isn't only about quantity of compute — it's about matching the *mode* of reasoning to the prompt.

A subtler thread is how you'd even estimate difficulty at inference time. Model confidence turns out to be a usable proxy: highly confident models resist prompt rephrasing and produce stable outputs, while low confidence signals fragility — suggesting a prompt where extra compute or alternative strategies might help Does model confidence predict robustness to prompt changes?. And the choice of inference strategy can't be decoupled from the prompt itself: prompts optimized in ignorance of the inference method (best-of-N, majority voting) systematically underperform, while jointly optimizing prompt and inference strategy yields large gains Does prompt optimization without inference strategy fail?.

The thing you might not have expected to want to know: budget allocation isn't just a runtime scheduling problem. It reaches back into architecture and training. Folding architectural variables (hidden size, attention ratios, GQA configuration) into scaling laws lets you build models that deliver more inference throughput per unit accuracy Can architecture choices improve inference efficiency without sacrificing accuracy?, and the reason adaptive allocation works at all traces to the finding that only a small minority of high-entropy 'forking' tokens carry the real reasoning decisions Do high-entropy tokens drive reasoning model improvements?. If most tokens are low-stakes filler and a few are pivotal, then spending compute where the forks are — and skipping where they aren't — is exactly what difficulty-aware budgeting is reaching for.

Sources 9 notes

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Does prompt optimization without inference strategy fail?

Prompts optimized without knowledge of the inference strategy (best-of-N, majority voting) systematically underperform. Joint optimization of both prompt and inference strategy yields up to 50% improvement across reasoning and generation tasks.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can inference budgets be allocated differently based on prompt difficulty?

Sources 9 notes

Next inquiring lines