Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
Abstract: Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive and agentic environments. Existing work primarily focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is notoriously difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions solely based on correctness, our approach leverages Monte Carlo Tree Search (MCTS) to construct training samples that recover correct trajectories from erroneous ones. A key challenge of agent task reflection lies in the necessity for timely revision rather than waiting until the end of a rollout to revise errors. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency.
A critical bottleneck in enhancing error recovery in interactive and agentic environments is the lack of step-level reflection data. Traditional approaches to collecting these datasets involve labor-intensive annotation processes, which are both time-consuming and costly (Lin et al., 2024; Zeng et al., 2024b; Zheng et al., 2024). Without robust reflection data, models face challenges in identifying and correcting their own errors, limiting their utility as intelligent agents. Constructing reflection datasets is thus essential for building agents capable of self-reflection and better decision-making. However, how to automatically construct such training samples is non-trivial. A significant challenge of agent task reflection lies in the necessity for timely revision rather than waiting until the end of a rollout to revise errors. If corrections are applied only at the end of the trajectory, the delayed revisions prevent agents from learning to detect and address errors as they occur, undermining their capacity for real-time self-reflection. Furthermore, delayed revisions may leave catastrophic errors unaddressed, particularly those occurring early in the trajectory.
To address these challenges, we propose Agent-R, a novel framework designed to enable LLM-based agents to perform on-the-fly reflection and self-improvement. Unlike previous reward-based approaches, which directly penalize or reward actions based solely on outcome correctness (Song et al., 2024b; Xiong et al., 2024; Shi et al., 2024; Putta et al., 2024), Agent-R introduces a dynamic self-training framework that revises errors at the step level. By leveraging Monte Carlo Tree Search (MCTS) (Kocsis and Szepesvári, 2006), Agent-R dynamically constructs training samples that recover correct trajectories from erroneous ones, effectively guiding the agent to navigate complex decision spaces. Specifically, Agent-R identifies the most suitable revision step (based on the current actor model) in an incorrect trajectory and connects it with the subsequent correct trajectory, enabling real-time recovery instead of rolling out to the end of the trajectory. This dynamic revision process not only enhances the agent’s reflection ability but also mitigates the risk of concatenating inconsistent or incoherent trajectories, which can occur with naive correction strategies.