Tree Search for Language Model Agents
Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search,
One significant bottleneck in existing agents arises from their inability to leverage test-time computation for exploration and multi-step planning. Search and planning is especially important in open ended web environments, as the potential action space (i.e., all possible actions one can take on a webpage) is much larger than in most video games or text-based simulators. There are often multiple plausible actions that must be sequenced to reach a goal, and being able to efficiently explore and prune trajectories is essential.
In deep learning, search algorithms with neural network components have been instrumental in achieving superhuman performance on many games: Monte-Carlo Tree Search (MCTS) (Browne et al., 2012) was used to provide lookahead search, and was pivotal in the AlphaGo (Silver et al., 2016) and AlphaGo Zero Silver et al. (2017) systems that achieved superhuman performance in the game of Go. Gray et al. (2020) performs one-step lookahead search to achieve SOTA on no-press Diplomacy. More recently, several papers (Yao et al., 2024; Besta et al., 2024) showed the potential of applying search to large language models to introduce exploration over multiple reasoning paths, enhancing performance on text based tasks that require non-trivial planning.
Most approaches treat this as a partially observable Markov decision process, and only condition on the current observation ot when predicting the next action at to take. This has significant limitations: the error of the agent compounds with each step, and if an erroneous action is taken at time t, it cannot be easily rectified if this leads to a bad state in the future. Our approach aims to alleviate this by explicitly conducting search and backtracking to identify better trajectories.