INQUIRING LINE

How does process-focused feedback compare to outcome-focused feedback in skill training?

This explores whether giving a model feedback on *how* it reasoned (step-by-step process) trains skills better than rewarding it only on *whether the final answer was right* (outcome) — and what the tradeoffs are.


This explores whether step-level feedback beats final-answer-only feedback when training a model to acquire a skill. The corpus comes down fairly clearly on one side: supervising the process usually wins. In agentic retrieval, scoring each intermediate retrieval step rather than just the final answer produces a substantial performance jump, especially when you contrast good and bad reasoning chains directly Does supervising retrieval steps outperform final answer rewards?. The intuition is that an outcome reward is information-starved — it tells the model it failed but not *where* or *why* — and that missing 'why' is exactly what lets models break through plateaus that pure numerical reward can't Can natural language feedback overcome numerical reward plateaus?.

But the interesting turn in this collection isn't 'process beats outcome' — it's that the line between them is blurrier than it looks. Several papers show you can manufacture process-like supervision *out of* outcome signals, dodging the expensive part (human step-by-step annotation). Reverse-curriculum RL slides the reasoning start point backward from near-completion, so plain outcome feedback ends up revealing step-level failure modes for free Can curriculum learning approximate expensive process supervision?. Tree-search rollouts do something similar structurally: by comparing sibling branches, trajectory-level rewards get converted into step-wise preference signals with no separate process-reward model at all Can tree structure alone convert outcome rewards into process supervision?. So 'process vs outcome' is partly a question of whether you pay for granularity up front or engineer it after the fact.

Why does the granularity matter for *skill* training specifically? Because a scalar score throws away a whole dimension of the feedback. One paper makes this explicit: feedback actually carries two things — evaluative information ('how good was that') and directive information ('change it this way') — and a single reward number can only hold the first Can scalar rewards capture all the information in agent feedback?. Rich, tokenized environment feedback can be turned into dense, per-token learning signal, effectively letting the policy act as its own step-level critic Can environment feedback replace scalar rewards in policy learning?. Process feedback isn't just 'more frequent reward' — it's a *different kind* of information that outcome reward structurally cannot encode.

There's also a quieter benefit that outcome reward doesn't deliver: process feedback keeps training healthy, not just accurate. Inserting step-level critique into the training loop preserves solution diversity and counteracts the 'tail narrowing' where a model prematurely collapses onto one strategy Do critique models improve diversity during training itself?. And the asymmetry can be pushed further — treating successful trajectories as concrete demonstrations while distilling failures into abstract lessons outperforms processing every episode the same way Should successful and failed episodes be processed differently?. The lesson living inside a failure is process information; the outcome label alone would discard it.

The thing you might not have expected to learn: the richest version of process feedback isn't a number at all, it's language. Chain-of-thought critiques, solicited corrective dialogue, and natural-language explanations of *why* something went wrong consistently outperform scalar rewards Can natural language feedback overcome numerical reward plateaus? Can LLMs learn to ask for feedback during problem solving?. So the real frontier isn't 'process vs outcome' — it's how cheaply you can recover the directive, language-shaped signal that outcome rewards leave on the floor.


Sources 9 notes

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can LLMs learn to ask for feedback during problem solving?

Research shows that reformulating static tasks as pedagogical dialogues—where a teacher has privileged information and the student must learn to extract it—trains models to actively engage conversation as a problem-solving tool, not just imitate dialogue patterns.

Next inquiring lines