Single-agent or Multi-agent Systems? Why Not Both?
Multi-agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS across diverse domains, enabled by long-horizon context tracking and error correction through role-specific agents. However, the design and deployment of MAS incur higher complexity and runtime cost compared to single-agent systems (SAS). Meanwhile, frontier LLMs, such as OpenAI-o3 and Gemini-2.5-Pro, have rapidly advanced in long-context reasoning, memory retention, and tool usage, mitigating many limitations that originally motivated MAS designs. In this paper, we conduct an extensive empirical study comparing MAS and SAS across various popular agentic applications. We find that the benefits of MAS over SAS diminish as LLM capabilities improve, and we propose efficient mechanisms to pinpoint the error-prone agent in MAS. Furthermore, the performance discrepancy between MAS and SAS motivates our design of a hybrid agentic paradigm, request cascading between MAS and SAS, to improve both efficiency and capability. Our design improves accuracy by 1.1-12% while reducing deployment costs by up to 20% across various agentic applications.
To investigate the root causes underlying these findings, we adopt a first-principles approach and abstract MAS execution as a dependency graph, where nodes represent agents and edges denote inter-agent communications (i.e., responses). We identify three primary sources of MAS defects:
• Node-Level Defect: The performance of both MAS and SAS are bottlenecked by a critical agent responsible for the most difficult subtask, capping MAS performance.
• Edge-Level Defect: Errors arise when a downstream agent becomes overwhelmed by (overthinking) inputs from upstream agents, compromising its ability to reason effectively.
• Path-Level Defect: Indecisive errors can propagate through chains of agent interactions, ultimately leading to incorrect final outputs.
However, identifying and mitigating these defects is challenging due to the variety of agentic applications. To address this, we propose a confidence-guided tracing method that considers both the confidence and output quality of agents to pinpoint critical agents—those that bottleneck overall MAS performance, whereby we prioritize augmenting the critical agents for better cost-effectiveness. In light of the performance discrepancy between SAS and MAS, we further introduce agent routing and agent cascade paradigms to selectively offload requests between SAS and MAS, thereby unlocking new accuracy-efficiency tradeoffs in agentic system deployments. Our evaluations show that our hybrid paradigm improves accuracy by 1.1-12% while reducing costs by up to 88.1%.
There are exceptions like AIME, which is considered the most difficult math dataset. As such, MAS is able to consistently outperform SAS on this dataset, illustrating its ability to solve extremely difficult tasks.
Edge-Level Defect: The downstream agent gets overwhelmed by inputs from other agents. In MAS, agents often engage in multi-way conversations or prolonged iterative refinements.
It is common for certain nodes in the MAS execution graph to exhibit high in-degree, such as the summarizer in math debates or the response synthesizer in RAG-based documentation systems [25, 13]. However, the influx of information from upstream agents can overwhelm the receiving agent, leading to overthinking on edge cases and ultimately producing incorrect results.
Such a phenomenon is analogous to the overthinking of the reasoning model [8], but rather than being "lost" in thinking, the agent becomes overwhelmed by inputs from upstream agents. Though we might mitigate this defect by better prompt engineering or extended context lengths, the root cause lies in hallucination [33] and overthinking, which are intrinsic challenges of LLMs themselves. MAS aggravates the problem as agents process much more data.
Path-Level Defect: Indecisive errors propagate along the path and become fatal. As information flows between agents, crucial context can be lost or diluted [44], especially when intermediate outputs are summarized or filtered before being passed along (e.g., from multi-hops ago). Even a small piece of lost information can cause irreversible errors in downstream reasoning due to a snowball effect. In contrast, SAS has full access to its history, though still limited by context length. Emerging models with extended long-context capabilities [10, 29] help alleviate this issue.
We hypothesize that this defect contributes to Observation 2, where SAS consistently outperforms MAS in a portion of cases, and conducted an extensive case study to validate it. We provide an example from the math reasoning task below, which shows that a correct solution proposed in an earlier round was lost during summarization before being passed to the next agent. This loss is unrecoverable, as downstream agents no longer have access to the full previous results.
[Round X-1, Solver B] The number of sets in Bob ’s list is given by $\ sum_ {a \in A} 2^{a -1} = 2024 $ ... So the answer is {{55}}
[Summarizer]: Results from three solvers are inconsistent ...
[Round X, Solver B] The error lies in the misinterpretation of the problem statement .
The number 2024 represents the total number of sets B, not the size of set A ...
So the answer is {{28}}
Unlike decisive errors studied in previous literature [43], these propagation-induced errors are difficult to detect and backtrace, as it requires global knowledge to determine which information can be safely diluted and which must be retained. However, the design principle of MAS—the division of tasks—runs counter to this requirement.
Summary: Think twice before adopting MAS deployment. While MAS still offers advantages such as better privacy [45], improved parallelism, and democratization [35], our extensive real-world evaluations reveal that MAS is not a one-size-fits-all solution for diverse tasks. In fact, it often suffers from cost-ineffectiveness. Consequently, naively converting a SAS solution into an MAS one may yield less accuracy improvement than expected, while incurring substantially higher costs.