INQUIRING LINE

Do outcome-only reward signals miss step-level errors that compound later?

This explores whether reward signals based only on final outcomes (right/wrong answer) overlook mistakes in the intermediate reasoning steps — and what the corpus offers as fixes.


This explores whether outcome-only reward signals miss step-level errors that compound later — and the corpus has a surprisingly direct answer: yes, and the failure has a measurable signature. Outcome-based reward models (ORMs) are trained only on whether the final answer was correct, and as a result they are *systematically pessimistic* about intermediate steps — a correct step on the way to an eventually-wrong answer gets penalized, producing high false-negative rates Why do outcome-based reward models fail at intermediate step evaluation?. So the issue isn't just that outcome signals are silent about steps; they actively misjudge good steps that happen to sit inside losing trajectories. The classic fix is a process reward model (PRM) that scores each step, but that demands expensive expert annotation — which is the real reason outcome-only training stays popular despite its blind spot.

Why do these step-level errors compound? One note traces the mechanism: RL training moves through two phases, where early learning is driven by *execution* correctness and the later bottleneck becomes *strategic planning*. As planning becomes the deciding factor, the errors that matter most are no longer local slips but bad high-level moves — exactly the kind a single end-of-trajectory scalar can't localize Does RL training follow a predictable two-phase learning sequence?. A related diagnosis is that numerical rewards plateau precisely because they carry no information about *why* a step failed or *how* to fix it; feeding the model chain-of-thought critiques instead of a number lets it climb past the plateau Can natural language feedback overcome numerical reward plateaus?. There's a clean information-theoretic version of this too: feedback decomposes into 'evaluative' (how good was it) and 'directive' (how should it change) signals, and a scalar reward keeps the first while discarding the second Can scalar rewards capture all the information in agent feedback?.

The more interesting part of the collection is how researchers recover step-level signal *without* paying for hand-annotated process labels. Tree-GRPO branches the rollout and compares sibling subtrees, converting a single trajectory-level outcome reward into step-level preferences automatically — structure alone manufactures process supervision Can tree structure alone convert outcome rewards into process supervision?. A different route uses the agent's own shifting beliefs: the log-ratio of how much closer each turn moves the model toward the solution becomes a dense per-step reward with no critic network at all Can an agent's own beliefs guide credit assignment without critics?. And when you do want an explicit step judge, training it to *reason about* the reasoning beats training a classifier to label steps — generative stepwise judges outperform discriminative ones with far less data Can judges that reason about reasoning outperform classifier rewards?.

What the reader might not expect is that the corpus also pushes back on the premise that finer-grained step rewards are always better. Binary outcome rewards don't just miss steps — they incentivize confident wrong answers, because a confident miss and a hesitant miss score identically; adding a proper scoring rule (Brier) fixes calibration without hurting accuracy Does binary reward training hurt model calibration?. Ternary rewards make 'I don't know' a learnable third option and cut hallucinations Can three-way rewards fix the accuracy versus abstention problem?. And the asymmetry runs the other way too: negative reinforcement alone — just suppressing wrong trajectories — can match full RL while preserving diversity, suggesting that *which* errors you penalize matters as much as *where* in the trajectory you penalize them Does negative reinforcement alone outperform full reinforcement learning? Should successful and failed episodes be processed differently?.

The through-line: outcome-only signals don't merely *miss* step errors, they *mislabel* good intermediate work, and the frontier isn't 'annotate every step' but 'extract step-level structure for free' — from tree branching, from belief dynamics, from rubrics used as gates rather than dense rewards Can rubrics and dense rewards work together without hacking?.


Sources 12 notes

Why do outcome-based reward models fail at intermediate step evaluation?

ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Next inquiring lines