Process Reward Models That Think

Paper · arXiv 2504.16828 · Published April 23, 2025

Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized stepwise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose THINKPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, THINKPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, THINKPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.1

Generative PRMs and LLM-as-a-Judge. Generative PRMs (Zheng et al., 2023; Zhu et al., 2023) frame verification as a language-generation task, producing step-level correctness decisions as natural language tokens (e.g., “correct” or “incorrect”), typically accompanied by a verification chain-of-thought (CoT). Generative PRMs rely on the standard language modeling objective, training on verification rationales rather than on binary labels. Step-level correctness scores can be derived from generative PRMs by computing conditional token probabilities, e.g., P(“correct”). This approach leverages the strengths of LLMs in text generation, making generative PRMs inherently interpretable and scalable (Zhang et al., 2024a; Mahan et al., 2024; Wang et al., 2023a; Ankner et al., 2024).

Recent work on generative PRMs often utilizes off-the-shelf LLMs prompted to critique solutions— known as LLM-as-a-Judge (Zheng et al., 2024). However, LLM-as-a-Judge can be unreliable, sensitive to prompt phrasing, and prone to invalid outputs, such as infinite looping or excessive overthinking (Bavaresco et al., 2024), issues we further confirm in this work. Prior results with reasoning models like QwQ-32B-Preview (Team, 2024) show promise, but their practical utility in test-time scaling remains limited without additional training (Zheng et al., 2024).

Test-Time Scaling and PRMs. Test-time scaling techniques, such as Best-of-N selection (Charniak & Johnson, 2005; Khalifa et al., 2023; Snell et al., 2024) and tree-based search (Wu et al., 2024; Yao et al., 2023; Chen et al., 2024c; Wan et al., 2024), leverage additional inference-time compute to improve reasoning performance. Central to these approaches is the quality of the verifier used to score and select solutions (Beeching et al.; Snell et al., 2024). While both discriminative and generative PRMs can guide these processes, generative PRMs uniquely support simultaneous scaling of both generator and verifier compute (Zhang et al., 2024a; Kalra & Tang, 2025). We show that generative PRMs trained based on long CoT models (Jaech et al., 2024; Guo et al., 2025; Muennighoff et al., 2025) enable sequential scaling of verification compute by forcing longer verification CoT. Motivated by limitations of existing approaches, our work builds upon prior efforts that train reward models using synthetic data (Zhu et al., 2023; Wang et al., 2024), aiming to develop generative PRMs with minimal, carefully filtered synthetic step-level supervision. Specifically, we demonstrate that a generative PRM fine-tuned with as few as 8K synthetic verification chains substantially improves over LLM-as-a-Judge PRMs, and outperforms discriminative PRMs trained on datasets orders of magnitude larger.