INQUIRING LINE

What information-theoretic framework explains why process rewards beat outcome only?

This explores whether the corpus offers an information-theoretic reason — not just empirical results — for why step-by-step (process) rewards outperform single end-of-task (outcome) rewards.


This explores whether the corpus offers an information-theoretic reason — not just an empirical scoreboard — for why step-level process rewards beat a single outcome reward. The sharpest answer it gives is this: a scalar outcome reward is a low-bandwidth channel, and process supervision works by recovering the information that channel throws away. The clearest statement of this is the finding that natural feedback splits into two orthogonal kinds of information — *evaluative* (how good was that action) and *directive* (how should it change) — and that a scalar reward can carry the first but structurally cannot carry the second Can scalar rewards capture all the information in agent feedback?. Outcome-only training collapses an entire trajectory onto one number; process rewards preserve where along the path the signal actually lives.

Several notes show the same gap from the failure side. When models hit a performance plateau that more numerical reward can't break, the missing ingredient turns out to be exactly the *why-it-failed* and *how-to-fix-it* information that a correct/incorrect scalar never encodes — supply it as a natural-language critique and the plateau breaks Can natural language feedback overcome numerical reward plateaus?. That's an information-theoretic claim in disguise: the reward signal was saturated, so adding more of it added no bits. Calibration work makes the same point formally — a binary correctness reward provably can't distinguish a confident wrong answer from a hesitant one because that distinction isn't in its support, and you have to add a second reward term (a proper scoring rule) to recover it Does binary reward training hurt model calibration?. Ternary rewards make the same move on a different axis, carving out abstention as its own representable state instead of folding it into 'wrong' Can three-way rewards fix the accuracy versus abstention problem?.

The interesting twist — and the thing you might not expect — is that 'process beats outcome' doesn't always require a separately trained process reward model. A whole cluster of recent work shows the missing per-step information can be *extracted* from structure already present in the rollouts. Tree-search branching converts trajectory-level outcome rewards into step-level preference signals by comparing sibling subtrees Can tree structure alone convert outcome rewards into process supervision?; more generally, tree topology, expert-aligned actions, and tool-call positions each serve as free sources of dense credit Can trajectory structure replace hand-annotated process rewards?. Even more strikingly, an agent's own shifting belief about the answer — the log-ratio of its sequential probability estimates — supplies dense per-turn credit with no critic and no annotation at all Can an agent's own beliefs guide credit assignment without critics?. The information was latent in the trajectory the whole time; outcome-only reward just declined to read it.

There's a quality caveat worth carrying. Denser isn't automatically better — converting rich rubric judgments into dense token rewards invites reward hacking, whereas using those same rubrics as accept/reject *gates* preserves their signal without it Can rubrics and dense rewards work together without hacking?. And the bits matter more than the count: generative judges that reason about a reasoning step outperform classifier-style reward models with orders of magnitude less data Can judges that reason about reasoning outperform classifier rewards?, while causal reward modeling shows that some of the 'information' in a naive reward is actually spurious correlation that has to be filtered out Can counterfactual invariance eliminate reward hacking biases?. So the framework the corpus points to isn't 'more signal wins' — it's that an outcome scalar is a lossy compression of feedback, process methods recover the discarded directive and positional information, and the real engineering question is recovering the *right* bits without manufacturing fake ones.


Sources 10 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Next inquiring lines