On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent’s belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy-to- obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.
Reinforcement learning (RL) with outcome-based rewards has demonstrated great success in improving the reasoning capabilities of Large language models (LLMs) (Wang et al., 2024; Srivastava & Aggarwal, 2025; Xu et al., 2025; Guo et al., 2025). Recently, it has received increasing attention in building agents based on LLMs, where the agent needs to interact with the environment and resolve tasks beyond single-turn interactions (Zhang et al., 2025a; Plaat et al., 2025). Usually, the tasks do not provide clear statements, e.g., underspecified user queries, and hence the agent needs to ask questions to acquire the missing information strategically, i.e., multi-turn active reasoning (Zhou et al., 2025; Wu et al., 2025; Laban et al., 2025; Li et al., 2025).
Despite the success, we find that training an LLM agent with outcome-based RL suffers from information self-locking (SeL). Under SeL, agents often get stuck in low-information interaction patterns, where the agent ceases to ask informative questions and struggles to internalize already-obtained information. This aligns with existing failure modes of agents in the real-world use (Wang et al., 2025b). To understand more about the failure modes of agents, we propose to decompose agentic behaviors in active reasoning into action selection (AS), which determines what information is queried, and belief tracking (BT), which governs how acquired evidence is internalized and affects the final outcome. Across two multi-turn active reasoning benchmarks in Sec. 2, we show that the two capabilities can not get effectively improved even when the task rewards increase. Hence, it raises a challenging research question:
Why does SeL happen and how to mitigate it?
Belief tracking is essential to the success of active reasoning. Essentially, the agent needs to model its belief bMt ∈ Δ(S) about the progress of the problem solving and what information remains missing throughout turn t ∈ {0, . . . ,H}. The behaviors of an agent with parameters ω in active reasoning can be decomposed into two coupled processes: Action Selection (AS): The agent selects an action (e.g., a question) according to a belief-conditioned policy at ∼ πQ ω (· | bMt ) aiming to elicit informative observations ot ∼ O(·|s⋆, at) from the environment; Belief Tracking (BT): After receiving an observation ot ∈ O, the agent updates its belief via an internal update operator bMt +1 = πU ω (bMt , at, ot), integrating information accumulated over previous interaction rounds.
2.3. Failure modes in reinforcement learning training Despite the success of RL with outcome-based rewards, interestingly, we find that LLM agents exhibit several failure modes across both active-reasoning testbeds during training. Observation 1: Reward improvements do not translate into increased information acquisition. Fig. 2a (PE-G) and Fig. 2b (MediQ) report the training dynamics of episode reward, per-turn AS, and per-turn BT. Across both datasets, we observe a pronounced decoupling: while the reward can be improved over training, BT exhibits only limited gains, and AS fails to improve, often plateauing or even degrading. This observation raises interesting questions about the confounding behaviors of AS and BT, as AS can not improve even with an improved BT.
To isolate the effect, we analyze the relationship between AS and reward under different BT capabilities. Specifically, we fix identical action sequences and compare outcomes when the observation stream is processed by (i) the agent’s internal BT versus (ii) stronger belief-update mechanisms, e.g., human-defined update rules or frontier reasoning models. Since the environment dynamics and action sequences are identical across conditions, the difference can be attributed solely to the belief update mechanism.
Observation 2: Weak belief tracking masks the contribution of informative actions. As shown in Fig. 2c and 2d, the correlation between AS and reward is substantially higher under strong BT, but remains weak when using the agent’s own BT. This indicates that the contribution of AS to the reward is masked when belief updates are unreliable: even high-information actions yield little reward improvement if their information is not incorporated into internal belief. As a result, policy optimization cannot yield stable learning signals to reinforce informative AS choices. Observation 3: Conservative action selection limits belief refinement and induces interaction-insensitive shortcuts.
Complementary to Obs. 2, we now examine the reverse direction of the coupling. When AS gets conservative and yields little informative evidence, BT is deprived of meaningful signals to learn from. Under outcome-only supervision, this even incentivizes shortcut behaviors that reduce reliance on interaction, reinforcing a low-information training regime. We observe that as training progresses, agents become less sensitive to informative observations and increasingly rely on early-stage context. In MediQ, we intervene by replacing all patient feedback with Unknown while keeping all other configurations unchanged. Notably, the induced performance drop becomes smaller after RL training (41.25→30.50 w/o RL versus 61.00→55.50 with RL; see Fig. 5a), suggesting that interaction-derived evidence has a weaker causal effect on the final decision. Crucially, this reduced sensitivity is accompanied by an increase in belief consistency (Fig. 5a; 78.7 w/o RL versus 92.8 with RL): the agent increasingly adheres to its initial judgment instead of revising beliefs in response to interaction, which reflects a more “stubborn” belief update pattern. Together, these form interaction under-utilization: once conservative AS restricts information exposure and weak BT struggles to internalize evidence, RL pressure favors non-interactive heuristics that stabilize outcomes while further suppressing exploration and evidence usage.
Information Self-Locking of RL training for active reasoning. Taken together, these observations indicate SeL emerges from a bidirectional coupling between AS and BT. The reward-relevant value of AS is mediated by the agent’s ability to absorb information through BT, while BT is in turn constrained by the information budget induced by AS. This mutual dependence can trap training dynamics in a low information regime, giving rise to a self-locking behavior.