Can reward models benefit from reasoning before scoring?
Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.
Test-time compute scaling has been studied extensively for generation — but three independent research teams have simultaneously discovered it applies equally to evaluation. Reward Reasoning Models (RRMs), RM-R1, and DeepSeek-GRM all converge on the same insight: reward modeling is a reasoning task, and allowing the evaluator to "think" before scoring produces better rewards.
RRMs (2025) use RL to foster self-evolved reward reasoning without requiring explicit reasoning traces as training data. The model generates a chain-of-thought reasoning process before producing final rewards, adaptively allocating compute to queries where appropriate rewards are not immediately apparent. Multi-response strategies (ELO rating, knockout tournament) enable flexible test-time compute scaling. Crucially, RRMs develop distinct reasoning patterns from untrained foundation models — the training successfully reshapes how the model approaches evaluation.
RM-R1 introduces Chain-of-Rubrics (CoR) — the model first categorizes input as "chat" or "reasoning," then follows different evaluation strategies. Chat tasks get self-generated rubrics, justifications, and evaluations. Reasoning tasks get solve-first-then-evaluate. This task-type perception enables tailored reward generation. The training pipeline combines reasoning distillation prior to RLVR — distillation alone is insufficient, and RLVR alone fails to fully realize reasoning capabilities. Both stages are needed.
DeepSeek-GRM uses Self-Principled Critique Tuning (SPCT) via rule-based online RL to generate principles adaptively per query-response pair, then critique against those principles. Parallel sampling generates diverse principle-critique sets, enabling finer-grained reward resolution with larger compute budgets. A meta RM further guides the voting process for better scaling performance.
The convergence matters because it identifies a bottleneck that was hiding in plain sight: the evaluator's capability ceiling constrains the entire alignment pipeline. Since Does the choice of RL algorithm actually matter for reasoning?, the prior-bounded ceiling applies to reward models too — but reasoning-enabled reward models raise that ceiling by allocating compute adaptively.
Source: Reward Models — Reward Reasoning Model (arxiv 2505.14674), RM-R1 (arxiv 2505.02387), Inference-Time Scaling for Generalist Reward Modeling (arxiv 2504.02495)
Related concepts in this collection
-
Can reasoning during evaluation reduce judgment bias in LLM judges?
Can training language model judges to think through their evaluations, rather than pattern-matching on surface features, mitigate the four known biases that make them vulnerable to manipulation attacks?
directly extends: J1 showed RL can train judges; RRM/RM-R1/SPCT show independent convergence on the approach
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
reward evaluation becomes another adaptive-compute domain
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
generative reward models (RRM/RM-R1) add a third category to the ORM/PRM taxonomy: interpretable reasoning + final reward
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
prior-bounded ceiling applies to reward models too; reasoning capability raises it
-
Why do self-improvement loops eventually stop improving?
Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?
reward reasoning models are a concrete mechanism for the evaluator co-evolution that Meta-Rewarding requires: adaptive test-time compute for evaluation means the judge can scale alongside the actor rather than remaining static
-
Do all AI skills improve equally as models scale?
Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.
FLASK's differential scaling justifies the RRM approach: reasoning-based evaluation specifically invests compute in Logical Thinking skills (which scale with compute) rather than User Alignment skills (which saturate early), targeting the evaluation dimensions where additional reasoning traces provide the most improvement
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reward reasoning models extend test-time compute scaling to reward evaluation by producing reasoning traces before scoring