Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Note · 2026-02-22 · sourced from Reasoning Architectures

The Reasoning Language Models Blueprint provides a precise taxonomy of the two primary reward model families and their failure modes:

Outcome-Based Reward Models (ORMs):

Evaluate reasoning solely based on final outcome: P(correct(zT+1) | z0, ..., zT+1)
Training objective is misaligned with intermediate step evaluation — they are trained on final outcomes only
Systematically pessimistic for intermediate steps: a correct intermediate step can look "wrong" if a subsequent error occurs
High false-negative rate: ORMs underestimate solvability of problems from intermediate states
Cannot distinguish between "the chain got lucky" and "the chain reasoned correctly"

Process-Based Reward Models (PRMs):

Evaluate reasoning step-by-step: P(correct(zt) | z0, ..., zt)
Dense rewards enable error localization — can pinpoint which step went wrong
Better alignment with MCTS, which requires per-action evaluation rather than per-trajectory evaluation
Trade-off: require extensive step-level annotations from skilled annotators (expensive), or from LLM-generated annotations (lower quality due to limited self-evaluation capability)

Q-Value models (Q-VMs) vs V-Value models (V-VMs): A further split. Q-VMs evaluate Q(s, a) — expected cumulative reward for taking action a in state s — and are preferred for MCTS because they evaluate edges (actions), not just nodes (states). V-VMs evaluate V(s) — expected cumulative reward from state s — and provide a broader state-level view but less guidance for action selection.

Generative Reward Models (GRMs) as a third category: The RRM and DeepSeek-GRM papers introduce a third family alongside ORMs and PRMs. GRMs harness LLMs to produce interpretable, natural-language feedback rather than scalar scores. They can follow adaptive evaluation instructions, construct synthetic training data, and self-improve through iterative refinement. GRMs unify scoring of single, paired, and multiple responses within pure language representation. However, concerns persist about evaluation reliability — LLMs may produce biased or hallucinated judgments that diverge from human standards. Since Can reward models benefit from reasoning before scoring?, GRMs become most powerful when combined with extended reasoning before judgment.

This taxonomy explains why Can self-supervised process rewards replace human annotation? matters: the annotation cost is the primary bottleneck for PRMs, and self-supervised approaches address precisely this.

The ORM/PRM split is also the reason Can curriculum learning approximate expensive process supervision? is significant — R3 uses outcome supervision only but achieves process-supervision-like step feedback by decomposing the problem curriculum.

Source: Reasoning Architectures

Related concepts in this collection

Can self-supervised process rewards replace human annotation? Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
addresses the annotation cost problem this note identifies
Can curriculum learning approximate expensive process supervision? Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
architectural workaround for the ORM/PRM trade-off
Does supervising retrieval steps outperform final answer rewards? Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
RAG-Gym extends the PRM advantage to agentic retrieval systems
Does failed-step fraction predict reasoning quality better? Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
PRM-detectable signal: failed steps as quality predictor
Can agents learn to reason better without just chasing rewards? Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.
agentic process supervision: RLVMR's programmatic meta-reasoning rewards (planning/exploration/reflection/monitoring) are a domain-specific PRM variant for agentic tasks, providing dense intermediate feedback without human annotation
Can judges that reason about reasoning outperform step classifiers? Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.
resolves the ORM/PRM trade-off differently: StepWiser makes process rewards self-supervised (no annotation cost) AND generative (interpretable reasoning about each step); self-segmentation into chunks-of-thought also addresses the step boundary problem that limits standard PRMs
Can generative reasoning improve process reward model efficiency? Do process reward models that generate reasoning before judging outperform traditional discriminative approaches? This explores whether letting verifiers think—not just score—changes what test-time scaling can achieve.
GenPRM/ThinkPRM collapse the ORM/PRM trade-off: generative PRMs achieve PRM-quality dense step evaluation with ORM-level annotation costs (1% of PRM800K data), because reasoning-before-judging extracts more signal per training example
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
a third option in the ORM/PRM taxonomy: L2T provides dense information-theoretic process rewards via PAC-Bayes bounds and Fisher information, annotation-free like ORMs but dense like PRMs; also quantifies the cost of outcome-only training — more than double the needed tokens

Concept map

18 direct connections · 123 in 2-hop network ·medium cluster

Why do outcome-based reward models fail at inter… Can self-supervised process rewards replace human … Can curriculum learning approximate expensive proc… Does supervising retrieval steps outperform final … Does failed-step fraction predict reasoning qualit… Can agents learn to reason better without just cha… Can judges that reason about reasoning outperform … Can generative reasoning improve process reward mo… Can we reward reasoning steps without human annota…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

outcome-based reward models are systematically pessimistic for intermediate reasoning steps while process-based models provide dense rewards at high annotation cost