Why Do Multi-agent LLM Systems Fail?

Paper · Source

Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains across popular benchmarks often remain minimal compared to single-agent frameworks. This gap highlights the need to systematically analyze the challenges hindering MAS effectiveness. We present MAST (Multi-Agent System Failure Taxonomy), the first empirically grounded taxonomy designed to understand MAS failures. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. Through this process, we identify 14 unique failure modes, organized into 3 overarching categories: (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.

A multi-agent system (MAS) is then defined as a collection of agents designed to interact through orchestration, enabling collective intelligence. MASs are structured to coordinate efforts, enabling task decomposition, performance parallelization, context isolation, specialized model ensembling, and diverse reasoning discussions He et al. (2024b); Mandi et al. (2023); Zhang et al. (2024); Du et al. (2023); Park et al. (2023a); Guo et al. (2024a).

The promising capabilities of agentic system has inspired research into specific agentic challenges. For instance, Agent Workflow Memory Wang et al. (2024e) addresses long-horizon web navigation by introducing workflow memory to enhance agent adaptability and efficiency. DSPy Khattab et al. (2023) and AgoraWang et al. (2024e) tackle issues in communication flow, and StateFlowWu et al. (2024b) focuses on state control within agentic workflows to improve task-solving capabilities.

While these works meaningfully contribute towards particular use cases, they do not provide a comprehensive understanding of why MASs fail or propose a strategy that can be broadly applied across domains.

Numerous benchmarks have been proposed to evaluate agentic systems Jimenez et al. (2024); Peng et al. (2024);Wang et al. (2024c); Anne et al. (2024); Bettini et al. (2024); Long et al. (2024). These evaluations are crucial in identifying challenges and limitations in agentic systems, yet they primarily facilitate a top-down perspective, focusing on higher-level objectives such as task performance, trustworthiness, security, and privacy Liu et al. (2023b); Yao et al. (2024b).

2.2 DESIGN PRINCIPLE FOR AGENTIC SYSTEMS

Several works highlight the challenges of building robust agentic systems and suggest new strategies, typically for single-agent designs, to improve reliability. For instance, Anthropic’s blog Anthropic (2024a) draws the importance of simplicity and modular components, such as prompt chaining and routing, rather than adopting overly complex frameworks. Similarly, Kapoor et al. (2024) shows that complexity can hinder real-world adoption for agentic systems. Our work extends these insights by systematically investigating the failure modes in MASs, offering a taxonomy that demonstrates why MASs fail, and suggesting solutions that align with these insights for agentic system design.