Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Why do RL agents stop asking informative questions?

RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.

Note · 2026-04-18 · sourced from Reasoning Architectures

RL-trained LLM agents develop a pathological pattern called "information self-locking": they stop asking informative questions and fail to integrate information they have already obtained. The mechanism decomposes into two interdependent components — Action Selection (AS: what to query next) and Belief Tracking (BT: how to update internal beliefs from observations) — that form a bidirectional feedback loop trapping training in a low-information regime.

The trap works as follows: weak Belief Tracking means the agent cannot internalize the value of informative answers it receives, so the gradient signal for choosing informative actions is attenuated. Simultaneously, conservative Action Selection means the agent never generates the diverse queries that would exercise and improve its belief-tracking capability. Each deficiency masks the other's potential contribution, creating a stable but suboptimal equilibrium.

Directional critiques — targeted feedback that addresses AS and BT independently — help agents escape the trap, producing up to 60% improvement. This suggests the lock is not a fundamental capability limitation but a training dynamics artifact.

This finding deepens the connection between RL training dynamics and Why can't advanced AI models take initiative in conversation?. Passivity was previously understood as a training objective problem (next-turn reward optimization discourages initiative). Information self-locking reveals that even agents trained with RL to seek information can become trapped in uninformative patterns — the problem is not just that agents are not trained to be proactive, but that RL training dynamics actively lock them into passivity through the AS-BT feedback loop.

The decomposition also connects to Why do reasoning LLMs fail at deeper problem solving?. Wandering exploration is an AS failure (choosing where to go next), while information self-locking adds BT failure (not learning from where you've been). Together they describe an agent that neither explores systematically nor learns from its exploration — a double deficit.

The directional critique finding — that targeted feedback on specific sub-capabilities breaks the lock — resonates with Can natural language feedback overcome numerical reward plateaus?. Scalar rewards cannot decompose the AS-BT problem; structured critique can.

Source: Reasoning Architectures Paper: "Information Self-Locking in RL" (2603.12109)

Original note title

rl-trained agents exhibit information self-locking — weak belief tracking and conservative action selection create a bidirectional trap in low-information regimes