ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Paper · arXiv 2506.18896 · Published June 23, 2025

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory–response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectoryaware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling finegrained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux- PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling.

these models have increasingly adopted a trajectory-response format of output: a lengthy, comprehensive, and less organized intermediate thinking trajectory, followed by a concise, step-by-step final response conditioned on the prior thinking (as illustrated in Figure 2).

The increasing utilization of trajectory–response data raises an important question: Can PRMs provide supervision not only to the final responses of large reasoning models, but also to their intermediate thinking trajectories?

We further find that this degradation stems primarily from two key issues: an structural and formatting mismatch between intermediate thinking trajectories and final responses, and the lack of trajectory–response data with assigned rewards during PRMs training.

In summary, our main contributions are:

• In-Depth Trajectory-Response Data Analysis in Long-CoT Reasoning. We identify, formulate, and analyze the problem of adapting several existing PRMs to supervise both models’ intermediate reasoning trajectories and their final responses, motivated by the increasing prevalence of trajectory–response distillation data in downstream post-training and test-time scaling.

• Trajectory-aware Reward Modeling for Data Selection, RL and Test-Time Scaling. We introduce ReasonFlux-PRM, a trajectory-aware process reward model that incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment for model thinking trajectories. ReasonFlux-PRM can be integrated into both offline and online workflows for more generalized purposes, including offline selection of high-quality training data, online policy optimization in RL training, and test-time scaling.

data as a tuple (s, a), where s = (s1, s2, . . . , sT ) denotes a thinking trajectory consisting of T intermediate steps, and a = (a1, a2, . . . , aT ) denotes the final response, which can also be structured as a chain-of-thought trace with T formatted and organized steps.

(i) Thinking trajectories often include branching, where the model revisits earlier steps, explores alternative paths, and revises prior assumptions—behavior rarely observed in the linear and polished structure of final responses. (ii) Thinking trajectories tend to exhibit weaker global coherence across steps, as each step is often locally focused and not optimized for narrative continuity.