Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
Abstract—Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, communication mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%. Through in-depth failure analysis, we develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self-diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.
Our investigation began with the first-level failure taxonomy, where all annotators agreed to categorize failures according to the roles of key phases: task planning, task execution, and response generation. Next, each annotator independently reviewed the failure logs to summarize second-level failure reasons. Finally, they collaboratively discussed their categorizations and reached consensus on the final taxonomy. 2) Failure taxonomy: Figure 3 presents the failure taxonomy, which encompasses 19 distinct causes across three tiers.
Task planning. A planner is responsible for breaking down user instructions into a sequence of executable sub-tasks for the code generator. This role is critical since the planner’s output directly guides subsequent agents and largely determines the success of the overall framework.We identified three common issues in planning: (1) improper task decomposition that generates steps that are logically incorrect or unsuitable for the assigned task; (2) failed self-refinement involves the model is unable to learn from its past errors, causing it to repeat the same failed sub-tasks in an infinite loop; (3) unrealistic planning refers to producing a sequence of plausible steps but exceeds the practical capabilities of downstream agents, making the sub-tasks impossible to execute.
Task execution. Task execution is the phase where the agent attempts to carry out the planned sub-tasks, involving the failures from the code generator and executor. Existing agent frameworks encounter three main failures: (1) the generator agent fails to exploit external tools (e.g., available functions),often due to a lack of online or tool-use knowledge. (2) the generator agent produces flawed code with syntax errors, functionality errors (executable but deviating from the intended output), incorrect API usage with wrong parameters, or showing conflicts to its original goal, and (3) executions also fail with improper environmental setup, such as missing dependency package and accessing a file that does not exist.
Response generation. Response generation is the final stage where the agent produces output for the user or the planner to use in subsequent iterations. Failures at this stage relate to how results are perceived and presented, even after the code has been executed. Three main failures causes are:
(1) context window restraint: the agent loses parts of the conversation, leading to responses that are disconnected from previous interactions (e.g., an overly large HTML file in a web crawling task), (2) formatting issue: the agent’s output contains irrelevant information or does not comply with the required format (e.g., returning a sentence when a number is expected), (3) maximum rounds exceeded: The agent reaches a preset limit on the number of interaction turns without successfully completing the task, despite attempting various plans.
- Common failure analysis: For the most common failure in phases (i.e., planning, execution, and response generation), we present one case for each in Figure 4 and analyse as follows.
Case 1: The user asks the agent to verify a linear relationship between data. However, instead of proceeding directly to generate the necessary code for the analysis, the planner adds a redundant step: asking the user for confirmation to use linear analysis, though such usage has been specified in task description. This unnecessary clarification introduces a bottleneck, halting the process until user feedback is provided. Such redundant planning not only delays the task but also degrades the user experience by creating unnecessary interaction.
Case 2: When tasked with counting the number of functions on a website, the agent generates code that operates on an incorrect assumption. The code soup.find_all(’dl’) presumes that all dl> HTML tags on the page are used exclusively for listing functions. However, on complex webpages like technical documentation, these tags are often used for a variety of purposes, including navigation, definitions, or other structural elements. This flawed assumption leads to an incorrect count and demonstrates a failure to understand the contextual use of HTML structure, resulting in faulty code.
Case 3: The agent fails when trying to find a specific data point. It first gets a KeyError due to an additional space in a column name. The agent then switches to an alternative strategy of retrieving the entire row, which also fails in Empty DataFrame. Such an error implies that the agent faces challenges in self-correcting based on the output of its previous checks. It therefore leads to a loop of failures that ultimately exceeds the maximum attempts and causes the task to fail.
V. ACTIONS ON MITIGATING AGENT FAILURES
The failures analyzed highlight critical weaknesses in agent systems, particularly in planning and error correction. To address these, we propose two key strategies as follows. Promoting planning ability with learning-fromfeedback.
Planner is the first and fundamental component of an autonomous agent, decomposing complex tasks into executable steps. We therefore advocate for a “learning-fromfeedback” design, where agents learn to re-plan from their previous operational environment feedback. Recent work shows that agents can dynamically adjust plans based on tool feedback, deciding whether to refine or restart [29], [30] the pre-defined plan, avoiding rigid and illogical steps. Such a feedback-aware mechanism also shows promise in software engineering applications like program repair [31] and code generation [32], [33]. This allows the agent to adapt new strategies when faced with unexpected outcomes.
Developing early-stop and navigation mechanism. Failures like infinite loops and hitting round limits highlight the agent’s inability to recover from repeated mistakes. To this end, future agent systems can develop a meta-controller that navigates to a certain agent upon root cause analysis, either replanning to correct a strategic error or invoking a specialized tool to fix a local execution fault. Proper navigation can efficiently fix the problem, reducing task attempts and improving reliability. Moreover, if the system detects repetitive, unresolved errors, the mechanism should trigger an “early stop”, halting the process before it hits the maximum round limit, thereby saving resources.