Look Before You Leap: Autonomous Exploration for LLM Agents

Paper · arXiv 2605.16143 · Published May 15, 2026
RL with Verifiable Rewards (RLVR)

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

Introduction. Large language model based agents have remarkable application in realistic scenarios involving multi-step interactions with complex and diverse environments [1, 2, 3, 4, 5]. With the advancement of Reinforcement Learning with Verifiable Rewards (RLVR), models have made substantial progress in interacting with complex environments to solve multi-step tasks [6, 7, 8]. Despite this progress, a key aspect remains underexplored: current RLVR approaches primarily optimize for task-completion rewards in known or static distributions, thereby encouraging instrumental behaviors aimed at solving predefined tasks. As a result, they provide limited incentive for developing the autonomous exploration capabilities required to adapt to novel, unfamiliar environments. In the absence of intrinsic exploratory capability, current LLM-based agents often exhibit a pattern of premature exploitation.

Discussion / Conclusion. We identify autonomous environment exploration as a missing but essential capability for LLM agents: models optimized primarily for task completion often exhibit premature exploitation. To study this capability systematically, we formalize exploration as an independent and trainable objective, and introduce Exploration Checkpoint Coverage (ECC) as a verifiable metric for quantifying the extent to which agents discover critical states, objects, and affordances within an environment. We further show that exploration can be explicitly instilled through interleaved GRPO with ECC-based rewards, enabling agents for more robust task execution and to first build grounded environment knowledge and then use it for downstream task execution under the Explore-then-Act paradigm.