How do reward models benefit from extended thinking during evaluation scoring?
This explores whether letting reward models 'think' — produce reasoning before they score — actually makes their judgments better, and what the corpus says about when that extra thinking helps versus backfires.
This explores whether letting reward models reason before they score — rather than spitting out a single number — improves the quality of their evaluations. The short answer from the corpus is yes, but with sharp caveats about how and how much. Three independent teams (RRM, RM-R1, DeepSeek-GRM) converged on the same discovery: adding a chain-of-thought before the reward score turns evaluation into something you can scale at test time, raising the capability ceiling of the reward model beyond what a one-shot scalar judge can reach Can reward models benefit from reasoning before scoring?. The interesting part isn't that reasoning helps — it's that evaluation, long treated as a fixed classifier step, turns out to be a reasoning problem you can pour more compute into.
The mechanism becomes clearer when you compare generative judges to old-style classifiers. Instead of mapping an output to a quality label, a generative judge writes out reasoning *about* the reasoning it's grading — and that meta-reasoning beats discriminative reward models, often with orders of magnitude less training data Can judges that reason about reasoning outperform classifier rewards?. Why? A scalar reward collapses everything into 'how good,' but feedback actually carries two separable things: an evaluative signal (how well it did) and a directive one (what to change). Reasoning gives the model room to recover the directive part a single number throws away Can scalar rewards capture all the information in agent feedback?. The same gap shows up when natural-language critiques break performance plateaus that numerical rewards alone can't — the words explain *why* a solution failed, which a scalar can't encode Can natural language feedback overcome numerical reward plateaus?.
But here's the thing a curious reader might not expect: more thinking is not monotonically better, and the corpus is blunt about it. Reasoning accuracy peaks and then *declines* past a critical token threshold — one study watched accuracy fall from 87% to 70% as thinking stretched from ~1,100 to ~16K tokens, because models overthink easy cases and underthink hard ones Does more thinking time always improve reasoning accuracy?. Worse, some of what looks like 'better reasoning' may just be variance inflation: longer traces broaden the output distribution so it covers correct answers more often — a sampling-coverage effect, not genuine improvement — and past a point the distribution gets too diffuse and accuracy drops Does extended thinking actually improve reasoning or just increase variance?. So the same caution applies to a reasoning reward model: extra deliberation can be coverage dressed up as judgment.
There's also a question of *whether* the thinking is productive at all, which depends entirely on training. Vanilla models use thinking mode counterproductively — it induces self-doubt that degrades performance — and only RL training flips that same mechanism into useful gap analysis Does extended thinking help or hurt model reasoning?. That's a strong hint that a reward model doesn't get smarter just by being told to think; it has to be trained so the thinking does evaluative work. A related design lesson: how you *use* the reasoning matters. Treating rubrics as gates that accept or reject rollouts prevents reward hacking better than converting rubric scores into dense rewards — structure in the evaluation beats more numeric signal Can rubrics and dense rewards work together without hacking?.
If you want to wander further afield, the corpus has two adjacent doorways. One: models can be trained to internalize evaluation entirely, learning to compute their own reward in the unused space after their output — reasoning-as-self-judgment with zero inference cost Can models learn to evaluate their own work during training?. Two: a cautionary note on where richer reward models can go wrong — personalizing them per user strips away the averaging effect of aggregate models and amplifies sycophancy and echo chambers Does personalizing reward models amplify user echo chambers?. Together the corpus suggests reasoning genuinely raises the ceiling on reward evaluation — but only when the model is trained to reason well, the thinking is kept short of its overthinking threshold, and the structure of the judgment is designed to resist gaming.
Sources 10 notes
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.