Reinforcement Learning for LLMs

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Test-time compute scaling has been studied extensively for generation — but three independent research teams have simultaneously discovered it applies equally to evaluation. Reward Reasoning Models (RRMs), RM-R1, and DeepSeek-GRM all converge on the same insight: reward modeling is a reasoning task, and allowing the evaluator to "think" before scoring produces better rewards.

RRMs (2025) use RL to foster self-evolved reward reasoning without requiring explicit reasoning traces as training data. The model generates a chain-of-thought reasoning process before producing final rewards, adaptively allocating compute to queries where appropriate rewards are not immediately apparent. Multi-response strategies (ELO rating, knockout tournament) enable flexible test-time compute scaling. Crucially, RRMs develop distinct reasoning patterns from untrained foundation models — the training successfully reshapes how the model approaches evaluation.

RM-R1 introduces Chain-of-Rubrics (CoR) — the model first categorizes input as "chat" or "reasoning," then follows different evaluation strategies. Chat tasks get self-generated rubrics, justifications, and evaluations. Reasoning tasks get solve-first-then-evaluate. This task-type perception enables tailored reward generation. The training pipeline combines reasoning distillation prior to RLVR — distillation alone is insufficient, and RLVR alone fails to fully realize reasoning capabilities. Both stages are needed.

DeepSeek-GRM uses Self-Principled Critique Tuning (SPCT) via rule-based online RL to generate principles adaptively per query-response pair, then critique against those principles. Parallel sampling generates diverse principle-critique sets, enabling finer-grained reward resolution with larger compute budgets. A meta RM further guides the voting process for better scaling performance.

The convergence matters because it identifies a bottleneck that was hiding in plain sight: the evaluator's capability ceiling constrains the entire alignment pipeline. Since Does the choice of RL algorithm actually matter for reasoning?, the prior-bounded ceiling applies to reward models too — but reasoning-enabled reward models raise that ceiling by allocating compute adaptively.


Source: Reward Models — Reward Reasoning Model (arxiv 2505.14674), RM-R1 (arxiv 2505.02387), Inference-Time Scaling for Generalist Reward Modeling (arxiv 2504.02495)

Related concepts in this collection

Concept map
16 direct connections · 147 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reward reasoning models extend test-time compute scaling to reward evaluation by producing reasoning traces before scoring