INQUIRING LINE

What distinguishes generative reward models from outcome-based and process-based approaches?

This explores how 'generative' reward models — ones that reason in words before scoring — differ from the older split between outcome rewards (judge only the final answer) and process rewards (judge each step), and why that difference matters.


This explores how generative reward models differ from the two classic ways of grading an AI's work: outcome-based scoring (only the final answer counts) and process-based scoring (each intermediate step gets a grade). The cleanest way to read the distinction in the corpus: outcome and process are about *what* you score, while 'generative' is about *how* the scorer arrives at its verdict. A generative reward model writes out chain-of-thought reasoning — it thinks about the answer or the step before committing to a number — instead of acting as a silent classifier that maps an input straight to a score.

That shift turns out to matter more than it sounds. Three independent teams (RRM, RM-R1, DeepSeek-GRM) found that letting a reward model reason before scoring unlocks test-time compute scaling for *evaluation itself* — you can spend more thinking to get a better judgment, which raises the ceiling of what the reward model can reliably grade beyond what flat outcome scoring achieves Can reward models benefit from reasoning before scoring?. The same pattern shows up at the step level: judges trained to produce a reasoning chain *about* the policy's reasoning beat classifier-style step graders, and do it with far less labeled data — a 1.5B generative process model beating GPT-4o, and one approach matching full-dataset discriminative verifiers using only 1% of the labels Can generative reasoning beat discriminative models with less training data? Can judges that reason about reasoning outperform classifier rewards?. So 'generative vs. discriminative' is a real axis, and it cuts *across* the outcome/process axis rather than replacing it.

The more interesting tension is between outcome and process supervision themselves — and here the corpus suggests you don't always have to choose. Tree-search rollouts can manufacture step-level (process) signal out of pure outcome rewards: by branching trajectories and comparing sibling subtrees, you get step-wise preferences for free, no separate process model or step annotation required Can tree structure alone convert outcome rewards into process supervision?. That's a quiet rebuke to the assumption that process supervision needs expensive per-step labels.

There's also a deeper point lurking about what a scalar reward can even capture. One line of work argues that natural feedback splits into two orthogonal channels — *evaluative* (how good was this?) and *directive* (how should it change?) — and that a single number, whether outcome or process, throws the directive part away Can scalar rewards capture all the information in agent feedback?. Generative judges that write critiques are partly recovering that lost channel: a verbal verdict carries directional information a scalar can't. Relatedly, work on rubrics shows it's often better to use a structured signal as a *gate* (accept/reject a rollout group) than to flatten it into a dense reward, which invites reward hacking Can rubrics and dense rewards work together without hacking?.

The thing worth carrying away: the field is drifting from reward models as silent scoring functions toward reward models as *reasoners and critics*. Scalar outcome rewards even have measurable pathologies — binary correctness rewards provably wreck calibration by rewarding confident wrong guesses Does binary reward training hurt model calibration? — which is part of why richer, generative, reasoning-based evaluation keeps winning. If you want to go deeper on any one thread, the test-time-compute angle and the data-efficiency angle are the two most surprising doorways.


Sources 7 notes

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can generative reasoning beat discriminative models with less training data?

GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Next inquiring lines