How should inference budget adapt based on problem difficulty?

This explores how a model should spend more or less compute at inference time depending on whether a problem is easy or hard — and the corpus reveals that knowing the difficulty turns out to be harder than acting on it.

This explores how a model should spend more or less compute at inference time depending on whether a problem is easy or hard. The clean answer the corpus starts with: don't spend uniformly. Giving every prompt the same token budget wastes compute on easy questions and starves hard ones — reallocating that same total budget adaptively (less for easy, more for hard) beats simply running a bigger model under a flat budget Can we allocate inference compute based on prompt difficulty?, How should we allocate compute budget at inference time?. So far, so intuitive. The interesting part is everything standing between that principle and making it work.

The first complication is that 'spend more on hard problems' assumes the model can tell what's hard. It mostly can — linear probes can decode a problem's difficulty straight out of a model's hidden states *before* it starts reasoning — yet models still overthink simple questions anyway. That's the striking finding: it's not a perception failure but an action-commitment failure, where the model senses the question is trivial and grinds through a long chain regardless Can models recognize question difficulty before they reason?. This is why you can't read difficulty off the output either: a longer reasoning trace doesn't reliably mean a harder problem. Trace length tracks difficulty only when the problem resembles training data and decouples completely once you go out of distribution — length reflects recall of familiar schemas more than genuine adaptive computation Does longer reasoning actually mean harder problems?.

That reframes what 'difficulty' even means for budgeting. Reasoning breakdowns aren't triggered by crossing some complexity threshold — they're triggered by *unfamiliarity*. Models fit instance-level patterns rather than general algorithms, so a long, intricate problem they've seen variants of can be easy, while a short novel one is hard Do language models fail at reasoning due to complexity or novelty?. So the right budgeting signal isn't 'how big is this problem' but 'how far is this from what I know.' And more budget isn't a cure-all: on genuine constrained-optimization tasks, models plateau around 55–60% regardless of scale or extra compute, which means some hard problems should get *less* budget, not more, because the spend simply won't convert Do larger language models solve constrained optimization better?.

There's also an upper bound on the 'more thinking is better' instinct. Accuracy versus chain-of-thought length follows an inverted U — it peaks at an intermediate length and then declines, and the optimal length rises with difficulty but *falls* as the model gets more capable. Stronger models need shorter chains for the same problem Why does chain of thought accuracy eventually decline with length?. So the adaptive policy is two-dimensional: budget should scale up with problem hardness and down with model capability, not just chase length.

The most promising direction in the corpus stops treating the budget as something dialed externally and lets the model route itself. Thinkless trains a single model to choose between extended reasoning and a direct answer using decoupled RL, learning self-calibrated routing without ever being handed difficulty labels — exactly the action-commitment gap the probing work exposed, closed through training rather than a heuristic Can models learn when to think versus respond quickly?. And the reason this is a training problem at root: the productivity of extra inference tokens is itself installed during training. Non-reasoning models never catch up to reasoning models no matter how much inference budget you give them, because the training regime is what makes additional tokens pay off Can non-reasoning models catch up with more compute?. The thing you didn't know you wanted to know: adaptive inference budgeting isn't really an inference-time knob — whether a model can spend its budget well, and whether it acts on the difficulty it already perceives, are both decided long before the prompt arrives.

Sources 9 notes

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

How should inference budget adapt based on problem difficulty?

Sources 9 notes

Next inquiring lines