Intrinsic Credit Assignment for Long Horizon Interaction
How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ΔBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for RL, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards.
We propose a scalable solution, ΔBelief-RL, a novel framework which guides learning and credit assignment on intermediate actions in long-horizon tasks. At each interaction during training, we monitor the change in agent's belief towards the target. We use this information as a dense training signal, ΔBelief reward, that reinforces actions which shift the agent's internal beliefs toward the target, while also using the final outcome as part of the reward. The benefit of our method is that it does not require a separately trained critic or process reward model, but rather uses the agent's own intrinsic beliefs as a proxy for each action's value. This enables intermediate credit assignment "for free" by leveraging the agent's own progress toward the solution. Furthermore, our training strategy is general-purpose; it can be applied to any task where the correct final outcome is available during training.
By tracking ΔBelief across the trajectory, we can quantify how each interaction resolves uncertainty, shifting the model's internal "world view" towards the correct solution. We calculate the per turn belief-change, denoted ΔBelief, as the log-ratio of sequential beliefs. By utilizing log-probabilities, we ensure numerical stability and prevent floating-point underflow during training. This dense, turn-level signal reinforces actions that lead to the most informative updates to the agent's internal world view.
Long-horizon RL suffers from sparse, trajectory-level rewards, making credit assignment difficult. A first line of work introduces dense supervision by training process reward models (PRMs) that rate intermediate steps. However, PRMs rely on expensive step-level supervision and can be exploited through reward over-optimization. Others replace learned dense rewards with verifiers, either by using an LLM-as-a-judge or by leveraging executable environment signals like unit tests. Recent work further strengthens tool-based verification and turn-level credit assignment in multi-turn coding settings. All these works focus on domains where the quality of actions can be checked automatically, such as mathematics and coding. In absence of programmatic checks for success, recent work explores verifier-free RL by deriving intrinsic rewards from the model's own probabilities. Our work proposes measuring belief updates at the interaction level and using them as an intrinsic reward for RL.
In this work, we demonstrate that an agent's internal belief shifts can effectively guide learning in long-horizon tasks. By providing fine-grained credit assignment for intermediate actions, ΔBelief-RL significantly enhances learning efficiency. ΔBelief-RL does this in a computationally efficient manner, measuring the contribution of individual actions without separate critic or reward models, with the relatively inexpensive step of measuring log-probabilities on the correct outcome. We train with ΔBelief-RL on the 20Qs task to teach models effective information-seeking; consequently, our CIA models at the 1.7B-4B scale not only outperform prior SoTA methods for multi-turn training, but also much larger 670B models. Notably, improved performance generalizes to extended interaction horizons and diverse out-of-distribution applications.