Can scalar rewards capture all the information in agent feedback?
Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
The OpenClaw-RL framework makes a decomposition that was implicit in prior agentic RL work but never formalized: when an agent acts and the environment responds, the response carries two distinct kinds of information. The evaluative signal scores the action — how well did it perform — and can be extracted as a scalar reward via a PRM judge. The directive signal specifies how the action should have been different — not just that it was wrong, but in what direction. These are orthogonal: high-quality directive information can accompany any evaluation, and scalar rewards systematically lose the directive component.
Consider a user who says "you should have checked the file first." The evaluative content is approximately -1 (the response was inadequate). But the directive content is token-level specific: check the file first. A PRM judge can convert the sentiment into a scalar, but the sequence-level correction vanishes into a single number. Similarly, a detailed SWE error trace often implies a concrete correction direction that scalar outcome rewards cannot convey. Current RLVR methods operate on scalar rewards (Does RLVR actually expand what models can reason about?) and cannot convert directive information into a directional policy gradient. Distillation methods can process structured corrections but require pre-curated feedback-response pairs rather than live signals.
OpenClaw-RL recovers the directive signal through Hindsight-Guided On-Policy Distillation (OPD): extract textual hints from the next state, construct an enhanced teacher context by injecting those hints, and distill token-level directional advantage back into the student policy. This is richer than any scalar reward because it teaches the model not just "that was wrong" but "here is what right looks like in these specific tokens." The empirical result — combining binary PRM-based RL with OPD via weighted loss yields significant gains over either alone — confirms the two signals are complementary, not redundant.
This decomposition matters beyond OpenClaw-RL because it clarifies a conceptual muddle in agentic RL. When people debate "should we use outcome rewards or process rewards, scalar or verbal," the answer is usually "both, decomposed properly." The outcome-vs-process trade-off (Why do outcome-based reward models fail at intermediate step evaluation?) assumes a single signal type. The scalar-vs-verbal distinction is treated as architectural (Can natural language feedback overcome numerical reward plateaus?). OpenClaw-RL reframes them as two projections of one signal: evaluative (dense scalar) and directive (token-level).
The generalization: any learning loop that reduces natural feedback to scalars is discarding the fraction of training signal that most resembles supervised learning. A corrective sentence contains its own teacher.
Source: Autonomous Agents
Related concepts in this collection
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
the framing this decomposition operates within
-
Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
establishes that verbal feedback contains information scalars cannot reach
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
another case where single-scalar objectives miss structure
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
the outcome/process axis is the wrong cut; evaluative/directive is closer to the information structure
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
critique-based training as a cousin: teaching the model the directive structure behind errors
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
scalar RLVR's structural ceiling that directive signals may penetrate
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
agent next-state signals decompose into evaluative and directive information that scalar rewards cannot jointly capture