Tree Search for LLM Agent Reinforcement Learning
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels.
The agentic RL manifests in two key challenges. First, heavy budget taken in LLM rollouts: agent settings require LLMs to interact with the environment over multi-turns and complete tasks through sequential decision-making, which consequently leads to agent trajectories with thousands or more tokens alongside multiple tool-calls. Existing group-based RL methods sample multiple independent trajectories for each task in a chain-based rollout scheme, with considerable redundancy in the sampling process. Second, sparse supervision in long-horizon, multi-turn trajectories: although agent trajectories grow with the number of turns, current agent RL approaches are still primarily driven by outcome rewards. Such trajectory-level sparse signals make it difficult to identify which specific steps or actions in a multi-turn, interdependent sequence contributed to success or failure.
Furthermore, to address the challenge of sparse supervision, we construct more fine-grained process supervision signals by estimating relative advantages based on the tree structure. Specifically, at every branching point of the tree, we back-propagate outcome rewards from the respective subtree leaves. The differences across sibling branches serve as a preference-learning objective, providing process-level supervision signals between subtrees, where the subtree depth determines the granularity of the process signal. Since our tree search strategy uses a random expansion, it inherently yields process signals of varying granularity, enabling the model to learn intermediate decision making. This meticulous design leverages the tree structure to transform trajectory-level signals into process-level supervision. Its reliance solely on outcome rewards without additional supervision highlights its scalability and plug-and-play nature.
In this work, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), adopting a tree-search rollout strategy in place of independent chain-based rollouts for LLM agent RL. Based on agent step-level nodes, Tree-GRPO carries out rollout sampling over a semantically well-defined search tree. By sharing common prefixes, the tree search sampling significantly reduces the rollout budget in terms of both tokens and tool calls during training. Tree-GRPO leverages the tree structure to conduct tree-based grouping for advantage estimation, introducing an implicit step-level preference-learning objective. Empirical evaluations on 11 datasets demonstrate the superiority of our tree-based approach for agentic RL.
Across models and conditions, two dominant deficits emerge: fallible representations of the board state, leading to invalid moves, and weak heuristic planning, leading to loops or moves that do not advance the puzzle toward the goal state. This work moves beyond aggregate performance metrics to offer a granular, qualitative analysis of why LLMs struggle with such tasks and how different interventions modulate these failures.