Can an agent's own beliefs guide credit assignment without critics?
Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.
Long-horizon RL suffers from sparse trajectory-level rewards. The standard fixes — process reward models trained on step-level annotations, external verifiers, LLM-as-judge — all require additional supervision infrastructure. PRMs need expensive step-level labels. Verifiers exist only for verifiable domains (math, code). Judges introduce their own reward-modeling biases.
ΔBelief-RL (2602.12342) finds the credit signal inside the agent itself. At each interaction step, compute the agent's current probability assigned to the target solution. Compare it to the probability before the interaction. The log-ratio of sequential beliefs is the ΔBelief reward — a dense, turn-level signal that reinforces actions which shift the agent's internal world view toward the correct solution. Actions that increase belief in the target get rewarded; actions that don't, don't.
The elegance is that no separate model is needed. The agent's own log-probabilities on the correct outcome are the value signal. There is no critic to train, no PRM to maintain, no judge to query. The relatively inexpensive step is measuring log-probabilities on the target — a single forward pass per turn.
Two properties make this work. First, it is general-purpose: applies to any task where the correct final outcome is available during training (which is most supervised settings). Second, it is noise-robust to over-optimization: PRMs can be exploited because their reward signal is a learned approximation; ΔBelief is grounded in the model's own evolving probability assignment, which is harder to game because the only way to increase log-probability of the target is to actually integrate information that supports it.
Empirically, ΔBelief-RL on 20Qs trains CIA models at 1.7B-4B scale that outperform prior SOTA multi-turn methods and even 670B models. Performance generalizes to extended interaction horizons beyond training and to OOD applications (customer service, personalization).
The mechanism aligns with Can conversations themselves personalize without user profiles?: both reward uncertainty reduction. But ΔBelief's signal is about the target's probability specifically, while curiosity reward is about general uncertainty over user type. ΔBelief is information-theoretically tighter — it rewards moves toward the actual answer, not all moves that increase clarity.
The broader implication: in any setting where the model has ground-truth final outcome, the model's own probability shift can serve as dense intrinsic reward. The reward model is not load-bearing.
Paper: Intrinsic Credit Assignment for Long Horizon Interaction
Related concepts in this collection
-
Can conversations themselves personalize without user profiles?
Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?
both reward uncertainty reduction; ΔBelief is target-specific, curiosity reward is type-general — different information-theoretic targets
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
L2T uses PAC-Bayes/Fisher; ΔBelief uses log-ratio of sequential beliefs; both convert outcome correctness into dense step-level reward without annotation
-
Can environment feedback replace scalar rewards in policy learning?
Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
convergent verifier-free move via different mechanism: SDPO uses feedback-conditioned self-teacher; ΔBelief uses belief-shift on target
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
three independent paths to RL without external preference labels are converging
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
belief-shift toward the target solution is a dense intrinsic reward — log-ratio of sequential beliefs provides per-turn credit without separate critic or PRM