How do reward models guide inference-time compute allocation decisions?

This explores the link between two ideas that usually live in separate corners of the corpus — reward models (the things that score model outputs) and inference-time compute allocation (deciding how much thinking to spend per prompt) — and asks whether the former actually drives the latter.

This explores how reward models relate to the decision of how much compute to spend at inference, and the corpus suggests the connection is real but indirect — reward signals shape *where* effort goes more than they issue explicit budget orders. The cleanest direct link is that reward models have themselves become consumers of inference compute: instead of scoring an answer in one pass, reasoning reward models think before they judge, and three independent teams found this chain-of-thought-before-scoring raises the quality ceiling of evaluation itself Can reward models benefit from reasoning before scoring?. A parallel line shows that judges which *reason about reasoning* — producing critiques of each step rather than a single classifier verdict — outperform traditional discriminative reward models with far less training data Can judges that reason about reasoning outperform classifier rewards?. So the first answer to the question is almost recursive: better reward models are ones that spend more inference compute.

The more interesting answer is about allocation. The corpus is emphatic that uniform inference budgets waste resources — easy prompts get over-served and hard ones under-served — and that reallocating the *same* total compute by difficulty beats simply using a bigger model Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?. The deep question is how a model *knows* a prompt is hard enough to deserve more compute. That difficulty signal is exactly what a reward or verifier model can supply: it scores partial work, and low scores justify spending more. This is why test-time compute can stand in for raw model size on hard prompts Can inference compute replace scaling up model size? — the scoring mechanism is what tells you when the substitution is worth making.

There's a quieter route where models learn the allocation policy directly rather than consulting a separate scorer. Thinkless trains one model to route between extended reasoning and quick replies, using a decoupled RL setup that learns when to think without ever being handed explicit difficulty labels Can models learn when to think versus respond quickly?. Here the reward signal is baked into training and then expressed at inference as a routing decision — the allocation logic has been internalized. A related trick appears in pretraining: when models generate their own thinking traces, harder tokens automatically attract longer traces, a natural compute-allocation mechanism that mirrors test-time scaling Can training data augmentation match test-time compute scaling benefits?.

What the corpus also flags is that reward signals have limits as allocation guides. RLVR mostly *activates* strategies already latent in pretraining rather than teaching new ones — a single example or even spurious rewards can trigger the behavior — which means a reward model is steering existing capability, not creating it What does reward learning actually do to model reasoning?. And purely numerical rewards hit plateaus because they say *whether* an answer failed but not *why*; natural-language critiques break through those plateaus by carrying information a scalar score can't Can natural language feedback overcome numerical reward plateaus?. That's a hint that the richest allocation signals are descriptive, not just numeric.

The payoff for a curious reader: 'spend more compute on hard problems' sounds obvious, but the corpus reframes the reward model from a passive grader into the thing that *defines difficulty* — and shows the field splitting into two camps, one that consults an external reasoning judge at inference time and one that trains the allocation reflex directly into the model. It's worth noting the budget itself now has more than one axis: in agentic research, search iterations scale just like reasoning tokens, so a reward signal isn't only choosing how long to think but whether to think or to go look something up Does search budget scale like reasoning tokens for answer quality?.

Sources 10 notes

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

How do reward models guide inference-time compute allocation decisions?

Sources 10 notes

Next inquiring lines