Reasoning and Learning Architectures

Can an agent's own beliefs guide credit assignment without critics?

Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.

Note · 2026-05-18 · sourced from Reinforcement Learning
What actually changes inside a model during RL training? How well do reward models actually evaluate AI reasoning?

Long-horizon RL suffers from sparse trajectory-level rewards. The standard fixes — process reward models trained on step-level annotations, external verifiers, LLM-as-judge — all require additional supervision infrastructure. PRMs need expensive step-level labels. Verifiers exist only for verifiable domains (math, code). Judges introduce their own reward-modeling biases.

ΔBelief-RL (2602.12342) finds the credit signal inside the agent itself. At each interaction step, compute the agent's current probability assigned to the target solution. Compare it to the probability before the interaction. The log-ratio of sequential beliefs is the ΔBelief reward — a dense, turn-level signal that reinforces actions which shift the agent's internal world view toward the correct solution. Actions that increase belief in the target get rewarded; actions that don't, don't.

The elegance is that no separate model is needed. The agent's own log-probabilities on the correct outcome are the value signal. There is no critic to train, no PRM to maintain, no judge to query. The relatively inexpensive step is measuring log-probabilities on the target — a single forward pass per turn.

Two properties make this work. First, it is general-purpose: applies to any task where the correct final outcome is available during training (which is most supervised settings). Second, it is noise-robust to over-optimization: PRMs can be exploited because their reward signal is a learned approximation; ΔBelief is grounded in the model's own evolving probability assignment, which is harder to game because the only way to increase log-probability of the target is to actually integrate information that supports it.

Empirically, ΔBelief-RL on 20Qs trains CIA models at 1.7B-4B scale that outperform prior SOTA multi-turn methods and even 670B models. Performance generalizes to extended interaction horizons beyond training and to OOD applications (customer service, personalization).

The mechanism aligns with Can conversations themselves personalize without user profiles?: both reward uncertainty reduction. But ΔBelief's signal is about the target's probability specifically, while curiosity reward is about general uncertainty over user type. ΔBelief is information-theoretically tighter — it rewards moves toward the actual answer, not all moves that increase clarity.

The broader implication: in any setting where the model has ground-truth final outcome, the model's own probability shift can serve as dense intrinsic reward. The reward model is not load-bearing.


Paper: Intrinsic Credit Assignment for Long Horizon Interaction

Related concepts in this collection

Concept map
16 direct connections · 109 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

belief-shift toward the target solution is a dense intrinsic reward — log-ratio of sequential beliefs provides per-turn credit without separate critic or PRM