Agent Learning via Early Experience

Paper · arXiv 2510.08558 · Published October 9, 2025
DataTraining Fine TuningLLM ArchitectureReinforcement Learning

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent’s own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

To build such language agents, one promising solution is reinforcement learning (RL), where agents are trained by optimizing for expected cumulative reward returned by the environment. This paradigm has enabled traditional agents such as AlphaGo (Silver et al., 2016) to achieve superhuman performance in domains with well-defined environments and reward structures, such as Atari games (Bellemare et al., 2013) and the game of Go, echoing the vision of an emerging era of experience (Silver and Sutton, 2025) for language agents. However, applying RL to real-world language agents remains highly challenging now. Many environments of interest lack verifiable or dense reward signals, especially in open-ended settings such as websites where platforms do not expose ground truth feedback. For example, a form may appear to be submitted successfully, but the agent receives no indication of whether each piece of information was filled out correctly. In addition, tasks in multi-turn tool-use environments often involve long interaction sequences (Xie et al., 2024a; Jin et al., 2025) with delayed or ambiguous outcomes, making credit assignment and training inefficient and unstable.

As a workaround, most current language agents are instead trained on expert-curated data with supervised fine-tuning (SFT; Deng et al. (2023); Pahuja et al. (2025); Prabhakar et al. (2025)). This paradigm bypasses the need for reward signals by learning from human demonstrations, where agents map states to actions using static datasets. While SFT is straightforward and efficient to train, it has inherent limitations. The agent under this paradigm does not interact with the environment during training; it does not observe the outcomes of its own actions. This restricts its ability to learn from failure, refine its decision-making, or generalize to unseen situations (Chu et al., 2025). Furthermore, this approach assumes the data are expert or near-optimal, yet scaling high-quality human demonstrations is expensive and difficult to sustain. More critically, it locks the agent into a passive role, bound by the imagination and coverage of its training data rather than actively learning from its own experience. Given these limitations and that reliable reward signals are often unavailable aforementioned, how can we train agents to grow from their own experience, without any external reward signals?

Motivated by these limitations, we introduce the early experience paradigm, a middle ground between imitation learning and reinforcement learning, as shown in Figure 1. In this setting, agents learn not only from human-curated data but also from future states driven by their own proposed actions in the environment. These future states are the agent’s own experience, and can be transformed into supervision signals that enable it to grow directly from the consequences of its actions without relying on external reward signals. We explore two strategies to transform these future states as supervision: (1) ImplicitWorld Modeling: using the collected future states to help the agent build internal representations of environment dynamics, allowing it to better understand the environment by predicting the future states. (2) Self-Reflection: guiding the agent to compare its behavior with expert demonstrations, identify suboptimal decisions, and extract lessons to improve future decision-making. Both strategies share the same principle: in the absence of external rewards, the agent’s own actions and the resulting future states can still constitute experience that serves as a direct source of supervision. By turning future states generated from its own actions into learning signals, the language agent can continually improve without relying on additional human data or external rewards.

We comprehensively evaluate early experience across eight diverse environments, spanning embodied navigation, web navigation, multi-turn tool-use, long-horizon planning, and multi-domain API tasks, using multiple model architectures. Across all settings, both methods consistently outperform purely imitation learning baselines. Moreover, in environments where verifiable rewards are available, initializing RL with checkpoints trained with early experience methods leads to substantially stronger performance compared to standard imitation-learning warm starts. This shows that the performance gain from early experience stage can carry over to the final model’s performance after RL. Beyond these empirical gains, our analysis shows that early experience enables capabilities unattainable through imitation learning alone. It scales effectively, achieving comparable or superior performance with only half or even less of the expert data. The paradigm applies seamlessly to larger models, preserving its effectiveness across scales. These results show that early experience is not merely an alternative to imitation learning, but a practical and scalable bridge to reinforcement learning, delivering both immediate gains in effectiveness and long-term benefits for era of experience training regimes.

Our contributions are summarized as follows: (1) We advocate and formalize the early experience paradigm as a practical and scalable bridge between imitation learning and reinforcement learning for building autonomous language agents. It empowers agents to convert their own experience into learning signals without relying on external rewards and can be seamlessly integrated into existing training pipelines. (2) We propose and systematically study two training strategies under this paradigm: implicit world modeling, which enhances decision-making by modeling environment dynamics directly from collected experience, and self-reflection, which distills fine-grained lessons from the agent’s own actions. (3) We conduct a comprehensive evaluation across eight diverse environments and multiple model families. Our methods consistently improve task effectiveness, out-of-domain generalization, and downstream reinforcement learning performance, achieving state-of-the-art results on several benchmarks and offering actionable insights through detailed analysis.