Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

Paper · arXiv 2508.15202 · Published August 21, 2025
Reinforcement Learning

Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce Fin-PRM, a domainspecialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality.

Introduction. Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, leading to their increasing application in specialized domains such as finance(Yang, Liu, and Wang 2023; Zhu et al. 2025). However, financial reasoning tasks like financial statement analysis, investment strategy formulation, and regulatory compliance assessment demand a level of precision, factuality, and logical coherence that pushes the limits of current models. Therefore, a critical research direction is to align LLM reasoning pathways with expert cognitive processes monitoring tools, such as PRMs (Lightman et al. 2023a; Zhang et al. 2025; Setlur et al. 2024). PRMs selects the best one from multiple responses, often as part of test-time scaling strategies like Best-of-N (Khalifa et al. 2025; Liu et al. 2025), and gives scalarized reward signals to reinforcement learning progress (Zou et al. 2025; Cui et al. 2025).

Discussion / Conclusion. Our experimental results across SFT, Best-of-N, and RL applications consistently demonstrate that Fin-PRM outperforms general-purpose baselines. This success validates our central thesis: for high-stakes domains like finance, effective process supervision requires a reward model that is not just logically coherent but deeply specialized and factually grounded. The key to Fin-PRM’s performance is its duallevel, knowledge-aware architecture. By integrating verifiable reward components (racc and rcover) grounded in an expert-derived knowledge base, Fin-PRM moves beyond assessing mere logical plausibility to penalizing factual hallucinations. This confirms that for domains where truth is non-negotiable, a hybrid approach combining LLM-based qualitative assessment with explicit knowledge verification is critical. While this framework provides a robust proof-of-concept, we acknowledge several limitations that open important avenues for future research: the construction of our 3k-sample dataset, while high-quality, was resource-intensive.