Reinforcement Learning for LLMs

Why do standard process reward models fail on thinking traces?

Existing PRMs assume clean, sequential steps but reasoning models produce messy trajectories with branching and backtracking. Understanding this mismatch could improve how we supervise and evaluate exploratory reasoning.

Note · 2026-04-18 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time? What makes chain-of-thought reasoning actually work?

ReasonFlux-PRM identifies a structural mismatch that existing process reward models ignore: the thinking trajectories produced by reasoning models (o1-style, R1-style) have fundamentally different characteristics than the polished final responses those models output. Thinking traces include branching exploration, revisiting previous steps, backtracking from dead ends, and weaker global coherence. Standard PRMs trained on clean step-by-step solutions degrade when applied to this messy trajectory format.

The solution is trajectory-aware supervision — a PRM architecture that evaluates both the intermediate thinking trajectory and the final response, understanding that the trajectory's value lies in its exploratory structure, not in step-level correctness. This is a meaningful departure from the assumptions underlying both outcome-based reward models (which ignore the trajectory entirely) and standard process reward models (which assume clean, sequential steps).

Three deployment modes demonstrate the architecture's versatility: offline data selection (filtering training examples by trajectory quality), online RL policy optimization (providing dense rewards during training), and test-time scaling (guiding search at inference). The data selection use case is particularly relevant since Why do correct code trajectories teach models to tolerate errors? — trajectory-aware PRMs could provide the filtering signal that distinguishes genuinely good trajectories from lucky ones.

The key connection is to Can judges that reason about reasoning outperform step classifiers?. StepWiser's self-segmentation into "chunks of thought" partially addresses the trajectory structure problem by identifying logically complete units rather than arbitrary step boundaries. ReasonFlux-PRM goes further by explicitly modeling the branching and revisiting patterns rather than segmenting them away.

This also extends Which sentences actually steer a reasoning trace? — if backtracking sentences have disproportionate causal influence, a trajectory-aware PRM should learn to recognize and appropriately weight these anchor points rather than penalizing them as errors (which a standard PRM would do).

Since Does failed-step fraction predict reasoning quality better?, the trajectory-aware approach properly handles the fact that failed steps in a thinking trace are informative — they represent explored-and-rejected paths, not errors to penalize.


Source: Reasoning Methods CoT ToT

Related concepts in this collection

Concept map
14 direct connections · 106 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

trajectory-aware process reward models must handle branching and revisiting in thinking traces — standard PRMs degrade on trajectory-response format