TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper · arXiv 2412.14161 · Published December 18, 2024

To measure the progress of these LLM agents’ performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents–in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

We argue that it is, in part, due to a lack of objective benchmarks that not only demonstrate the power of existing LLM-based agents to accelerate a wide variety of repetitive tasks encountered in every-day workplaces, but also provide appropriate caveats about the tasks that agents cannot do. This is a pressing issue, because the commercial and policy implications of diverse and effective acceleration or automation of work-related tasks will be broad, both positive (e.g. increase of quality of life and accelerated scientific discovery) and negative (e.g. potential displacement or loss of jobs and increase in wealth disparities).

Coverage of Multiple Work-related Tasks: In order to make any valid statements about the potential of AI to accelerate or automate various types of real-world work, we should have tasks that are motivated by real-world work across multiple job categories. Many benchmarks are not relevant to real-world work (e.g. MiniWob++ (Liu et al., 2018)) or very relevant to real-world work, but only over a limited scope of tasks (e.g. SWE-Bench (Jimenez et al., 2024)). In contrast, TheAgentCompany contains a set of more diverse, realistic, and professional tasks that would typically be completed by multiple job roles in a software engineering company.

Requirement for Interaction If agents are integrated into real-world workplaces, they need to communicate with the other human members of the workspace. Most other benchmarks do not measure communication or interactivity, except for τ-bench (Yao et al., 2024) that only measures interaction in customer service scenarios. TheAgentCompany is a better testbed for communication, as asking and providing information to colleagues as part of many more complex tasks.

Long-horizon Tasks with Checkpoints In real-world settings, many tasks require many steps to achieve a higher-level goal. One novel contribution of TheAgentCompany is that we both (1) contain tasks that require an agent to perform significantly more consecutive work (i.e. involving more steps and realistically taking human professionals longer to accomplish) than previous benchmarks, and (2) provide granular evaluators that measure the ability of models to perform subtasks of larger tasks. Versatile Environment Interface: In order to handle a diversity of tasks in real-world settings, we minimally should be able to interact with the tools that real-world workers use – including web interfaces, programs, command-line terminals, and communication tools. TheAgentCompany covers all of these interfaces, while most previous benchmarks focus only on one or two.

Self-hosted and Reproducible: In order to allow for careful comparisons between different methods that remain constant over time, the benchmark should be fully self-hosted and reproducible. This contrasts with existing benchmarks that do not have execution environments (e.g. Mind2Web (Deng et al., 2023)) or require the usage of third-party hosted platform (e.g. WorkArena (Drouin et al., 2024), CRMArena (Huang et al., 2024)).

• Action Completion: Verifying whether required actions, such as using tools, navigating to URLs, or collecting data, were carried out successfully.

• Data Accuracy: Evaluating the correctness and completeness of the output, such as extracted data or formatted documents.

• Collaboration: Assessing interactions with simulated colleagues or sharing of output, such as posting messages or asking for additional information to complete the task.

4.2 Workflow

Each task typically follows a workflow with three stages. Initialization: The agent sets up its workspace and prepares to execute the task. Execution: The agent completes subtasks, such as navigating tools, collecting or processing data, or if required by the task, the agent interacts with simulated colleagues or shares results via communication platforms. Finalization: The agent produces and submits the final output for evaluation. A detailed example task can be found in Appendix C.

big gap for current AI agents to autonomously perform most of the jobs a human worker would do, even in a relatively simplified benchmarking setting. Looking at how different models perform on different types of tasks, we argue that tasks that involve social interaction with other humans, navigating through complex user interfaces designed for professionals, and tasks that are typically performed in private, without a significant open and publicly available resources, are the most challenging.

we do not cover more complex creative tasks such as brainstorming new product ideas or designing system architectures. Second, we are only using two agent scaffolds as the baseline performance, and others may differ in performance. Third, while it would be interesting to know the actual performance of human professionals on these tasks to understand how LLM agents perform in comparison, due to resource limitations we were not able to perform this comparison in the current iteration of TheAgentCompany. Fourth, the topic and content of the tasks were mostly created through introspection by people familiar with these workspaces, which may result in some disconnect with actual tasks performed in enterprise settings