How do outcome and process rewards differ in their treatment of intermediate steps?
This explores the core design split in reward modeling: outcome rewards only score the final answer, while process rewards score each intermediate step — and what that difference costs and buys you.
This explores the core design split in reward modeling: outcome rewards judge a reasoning trace only by its final answer, while process rewards judge each intermediate step along the way. The cleanest statement of the trade-off is that outcome-based reward models are *systematically pessimistic* about intermediate steps — because they only ever see whether the end result was right, they tend to mark perfectly good middle steps as failures whenever the final answer happens to be wrong, producing high false-negative rates Why do outcome-based reward models fail at intermediate step evaluation?. Process reward models fix this by giving step-level feedback, but the classic catch is cost: someone has to annotate which steps are good, and skilled annotation is expensive Why do outcome-based reward models fail at intermediate step evaluation?.
Why does scoring steps matter so much? Because most failures in long reasoning traces are not wrong final answers — they're process violations partway through. One striking result: adding intermediate verification of states and policy compliance during generation lifted task success from 32% to 87%, precisely because final-answer scoring is blind to where the agent actually went off the rails Where do reasoning agents actually fail during long traces?. Outcome reward also fails silently when *every* rollout fails — there's no signal to learn from. Step-wise expert-similarity rewards give a dense signal even then, which is what lets small models learn hard reasoning that sparse outcome-only RLVR can't teach Can step-wise expert rewards help small models learn hard reasoning?. Concretely, on agentic retrieval, supervising the intermediate retrieval steps substantially beats rewarding only the final answer Does supervising retrieval steps outperform final answer rewards?.
The most interesting recent move, though, is dissolving the dichotomy — getting step-level signal *without* paying for step-level annotation. Several methods derive process supervision from the structure of the trajectory itself: tree-search rollouts compare sibling subtrees to turn a single outcome reward into step-wise preferences automatically Can tree structure alone convert outcome rewards into process supervision?, and more broadly, tree topology, expert-aligned actions, and tool-call positions can each substitute for a separately trained process reward model Can trajectory structure replace hand-annotated process rewards?. A different route assigns the full episode's cumulative reward back to each step and lets group-relative normalization across rollouts surface which step-sequences actually mattered — outcome reward, but with credit pushed down to the steps Can full episode rewards per step enable better credit assignment?.
There's also a quieter shift in *what* a step reward should even be. Generative judges that reason about a reasoning step — rather than classify it as good/bad — turn out to be both more accurate and far more data-efficient, undercutting the old assumption that process supervision must be a costly labeling exercise Can judges that reason about reasoning outperform classifier rewards?. And process models built for polished answers break on real thinking traces, which branch, backtrack, and revisit; trajectory-aware PRMs have to treat a failed step as informative exploration rather than an error Why do standard process reward models fail on thinking traces?.
The thing worth taking away: the outcome-vs-process line isn't really about *where* you put the reward, it's about how much information you're willing to throw away. A scalar outcome reward collapses a rich trajectory into one bit, and you can recover a surprising amount of the discarded structure — evaluative *and* directive information Can scalar rewards capture all the information in agent feedback?, or asymmetric handling of wins versus failures Should successful and failed episodes be processed differently? — without ever hand-labeling a single step.
Sources 11 notes
ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Standard PRMs degrade on trajectory format because thinking traces include branching, backtracking, and weaker coherence than polished responses. ReasonFlux-PRM addresses this by supervising both trajectories and responses, treating failed steps as informative exploration rather than errors.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.