Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
The Reasoning Language Models Blueprint provides a precise taxonomy of the two primary reward model families and their failure modes:
Outcome-Based Reward Models (ORMs):
- Evaluate reasoning solely based on final outcome: P(correct(zT+1) | z0, ..., zT+1)
- Training objective is misaligned with intermediate step evaluation — they are trained on final outcomes only
- Systematically pessimistic for intermediate steps: a correct intermediate step can look "wrong" if a subsequent error occurs
- High false-negative rate: ORMs underestimate solvability of problems from intermediate states
- Cannot distinguish between "the chain got lucky" and "the chain reasoned correctly"
Process-Based Reward Models (PRMs):
- Evaluate reasoning step-by-step: P(correct(zt) | z0, ..., zt)
- Dense rewards enable error localization — can pinpoint which step went wrong
- Better alignment with MCTS, which requires per-action evaluation rather than per-trajectory evaluation
- Trade-off: require extensive step-level annotations from skilled annotators (expensive), or from LLM-generated annotations (lower quality due to limited self-evaluation capability)
Q-Value models (Q-VMs) vs V-Value models (V-VMs): A further split. Q-VMs evaluate Q(s, a) — expected cumulative reward for taking action a in state s — and are preferred for MCTS because they evaluate edges (actions), not just nodes (states). V-VMs evaluate V(s) — expected cumulative reward from state s — and provide a broader state-level view but less guidance for action selection.
Generative Reward Models (GRMs) as a third category: The RRM and DeepSeek-GRM papers introduce a third family alongside ORMs and PRMs. GRMs harness LLMs to produce interpretable, natural-language feedback rather than scalar scores. They can follow adaptive evaluation instructions, construct synthetic training data, and self-improve through iterative refinement. GRMs unify scoring of single, paired, and multiple responses within pure language representation. However, concerns persist about evaluation reliability — LLMs may produce biased or hallucinated judgments that diverge from human standards. Since Can reward models benefit from reasoning before scoring?, GRMs become most powerful when combined with extended reasoning before judgment.
This taxonomy explains why Can self-supervised process rewards replace human annotation? matters: the annotation cost is the primary bottleneck for PRMs, and self-supervised approaches address precisely this.
The ORM/PRM split is also the reason Can curriculum learning approximate expensive process supervision? is significant — R3 uses outcome supervision only but achieves process-supervision-like step feedback by decomposing the problem curriculum.
Source: Reasoning Architectures
Related concepts in this collection
-
Can self-supervised process rewards replace human annotation?
Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
addresses the annotation cost problem this note identifies
-
Can curriculum learning approximate expensive process supervision?
Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
architectural workaround for the ORM/PRM trade-off
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
RAG-Gym extends the PRM advantage to agentic retrieval systems
-
Does failed-step fraction predict reasoning quality better?
Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
PRM-detectable signal: failed steps as quality predictor
-
Can agents learn to reason better without just chasing rewards?
Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.
agentic process supervision: RLVMR's programmatic meta-reasoning rewards (planning/exploration/reflection/monitoring) are a domain-specific PRM variant for agentic tasks, providing dense intermediate feedback without human annotation
-
Can judges that reason about reasoning outperform step classifiers?
Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.
resolves the ORM/PRM trade-off differently: StepWiser makes process rewards self-supervised (no annotation cost) AND generative (interpretable reasoning about each step); self-segmentation into chunks-of-thought also addresses the step boundary problem that limits standard PRMs
-
Can generative reasoning improve process reward model efficiency?
Do process reward models that generate reasoning before judging outperform traditional discriminative approaches? This explores whether letting verifiers think—not just score—changes what test-time scaling can achieve.
GenPRM/ThinkPRM collapse the ORM/PRM trade-off: generative PRMs achieve PRM-quality dense step evaluation with ORM-level annotation costs (1% of PRM800K data), because reasoning-before-judging extracts more signal per training example
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
a third option in the ORM/PRM taxonomy: L2T provides dense information-theoretic process rewards via PAC-Bayes bounds and Fisher information, annotation-free like ORMs but dense like PRMs; also quantifies the cost of outcome-only training — more than double the needed tokens
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
outcome-based reward models are systematically pessimistic for intermediate reasoning steps while process-based models provide dense rewards at high annotation cost