Can generative reasoning improve process reward model efficiency?
Do process reward models that generate reasoning before judging outperform traditional discriminative approaches? This explores whether letting verifiers think—not just score—changes what test-time scaling can achieve.
Process reward models (PRMs) are central to test-time scaling but face three limitations: limited generalization across models and tasks, dependence on scalar value prediction that ignores LLM generative abilities, and inability to scale test-time verification compute. Two converging approaches solve these by reframing process supervision as a generative task.
GenPRM integrates Chain-of-Thought reasoning and code verification before providing judgment for each reasoning step. Using Relative Progress Estimation (RPE) — a relative criterion for label estimation rather than hard labels — and a rationale synthesis framework with code verification, GenPRM achieves strong results with only 23K training examples from MATH. A 1.5B GenPRM outperforms GPT-4o on ProcessBench; a 7B version surpasses Qwen2.5-Math-PRM-72B.
ThinkPRM capitalizes on the inherent reasoning abilities of long CoT models, fine-tuning with as few as 8K synthetic verification chains. Using only 1% of the process labels in PRM800K, ThinkPRM outperforms LLM-as-a-Judge and discriminative verifiers across ProcessBench, MATH-500, and AIME '24. In out-of-domain evaluation (GPQA-Diamond, LiveCodeBench), it surpasses discriminative PRMs trained on the full PRM800K by 8% and 4.5% respectively.
The key structural advantage: generative PRMs uniquely support simultaneous scaling of both generator and verifier compute. Discriminative PRMs output a fixed scalar; generative PRMs can be forced to think longer, producing more thorough verification. Under the same token budget, ThinkPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on ProcessBench.
Since Can judges that reason about reasoning outperform step classifiers?, GenPRM and ThinkPRM provide the strongest evidence and specific mechanisms. Since Can reward models benefit from reasoning before scoring?, generative PRMs establish the paradigm: the verifier should think before judging, just as the generator should think before answering.
Source: RLVR
Related concepts in this collection
-
Can judges that reason about reasoning outperform step classifiers?
Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.
GenPRM/ThinkPRM provide the strongest implementations
-
Can reward models benefit from reasoning before scoring?
Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.
generative PRMs operationalize reward-compute scaling
-
Can self-supervised process rewards replace human annotation?
Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
GenPRM's RPE and ThinkPRM's synthetic chains reduce annotation dependence
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
generative PRMs must ensure their CoT actually drives judgment, not just decorates it
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
generative process reward models that reason before judging outperform discriminative prms with orders of magnitude less data