Reinforcement Learning for LLMs

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Note · 2026-02-22 · sourced from Reasoning Architectures

The Reasoning Language Models Blueprint provides a precise taxonomy of the two primary reward model families and their failure modes:

Outcome-Based Reward Models (ORMs):

Process-Based Reward Models (PRMs):

Q-Value models (Q-VMs) vs V-Value models (V-VMs): A further split. Q-VMs evaluate Q(s, a) — expected cumulative reward for taking action a in state s — and are preferred for MCTS because they evaluate edges (actions), not just nodes (states). V-VMs evaluate V(s) — expected cumulative reward from state s — and provide a broader state-level view but less guidance for action selection.

Generative Reward Models (GRMs) as a third category: The RRM and DeepSeek-GRM papers introduce a third family alongside ORMs and PRMs. GRMs harness LLMs to produce interpretable, natural-language feedback rather than scalar scores. They can follow adaptive evaluation instructions, construct synthetic training data, and self-improve through iterative refinement. GRMs unify scoring of single, paired, and multiple responses within pure language representation. However, concerns persist about evaluation reliability — LLMs may produce biased or hallucinated judgments that diverge from human standards. Since Can reward models benefit from reasoning before scoring?, GRMs become most powerful when combined with extended reasoning before judgment.

This taxonomy explains why Can self-supervised process rewards replace human annotation? matters: the annotation cost is the primary bottleneck for PRMs, and self-supervised approaches address precisely this.

The ORM/PRM split is also the reason Can curriculum learning approximate expensive process supervision? is significant — R3 uses outcome supervision only but achieves process-supervision-like step feedback by decomposing the problem curriculum.


Source: Reasoning Architectures

Related concepts in this collection

Concept map
18 direct connections · 123 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

outcome-based reward models are systematically pessimistic for intermediate reasoning steps while process-based models provide dense rewards at high annotation cost