Reinforcement Learning for LLMs

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Note · 2026-04-07 · sourced from Autonomous Agents
How should we allocate compute budget at inference time? How does reinforcement learning reshape what models can reason about?

The OpenClaw-RL framework makes a decomposition that was implicit in prior agentic RL work but never formalized: when an agent acts and the environment responds, the response carries two distinct kinds of information. The evaluative signal scores the action — how well did it perform — and can be extracted as a scalar reward via a PRM judge. The directive signal specifies how the action should have been different — not just that it was wrong, but in what direction. These are orthogonal: high-quality directive information can accompany any evaluation, and scalar rewards systematically lose the directive component.

Consider a user who says "you should have checked the file first." The evaluative content is approximately -1 (the response was inadequate). But the directive content is token-level specific: check the file first. A PRM judge can convert the sentiment into a scalar, but the sequence-level correction vanishes into a single number. Similarly, a detailed SWE error trace often implies a concrete correction direction that scalar outcome rewards cannot convey. Current RLVR methods operate on scalar rewards (Does RLVR actually expand what models can reason about?) and cannot convert directive information into a directional policy gradient. Distillation methods can process structured corrections but require pre-curated feedback-response pairs rather than live signals.

OpenClaw-RL recovers the directive signal through Hindsight-Guided On-Policy Distillation (OPD): extract textual hints from the next state, construct an enhanced teacher context by injecting those hints, and distill token-level directional advantage back into the student policy. This is richer than any scalar reward because it teaches the model not just "that was wrong" but "here is what right looks like in these specific tokens." The empirical result — combining binary PRM-based RL with OPD via weighted loss yields significant gains over either alone — confirms the two signals are complementary, not redundant.

This decomposition matters beyond OpenClaw-RL because it clarifies a conceptual muddle in agentic RL. When people debate "should we use outcome rewards or process rewards, scalar or verbal," the answer is usually "both, decomposed properly." The outcome-vs-process trade-off (Why do outcome-based reward models fail at intermediate step evaluation?) assumes a single signal type. The scalar-vs-verbal distinction is treated as architectural (Can natural language feedback overcome numerical reward plateaus?). OpenClaw-RL reframes them as two projections of one signal: evaluative (dense scalar) and directive (token-level).

The generalization: any learning loop that reduces natural feedback to scalars is discarding the fraction of training signal that most resembles supervised learning. A corrective sentence contains its own teacher.


Source: Autonomous Agents

Related concepts in this collection

Concept map
16 direct connections · 123 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

agent next-state signals decompose into evaluative and directive information that scalar rewards cannot jointly capture