When do multi-agent systems actually outperform single agents?
As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.
"Single-agent or Multi-agent Systems? Why Not Both?" (2025) provides an empirical and theoretical analysis of when multi-agent systems (MAS) help versus hurt, with a finding that challenges the default toward multi-agent architectures.
The diminishing advantage. Prior studies reported MAS accuracy superiority across diverse domains. However, as frontier LLMs rapidly advance in long-context reasoning, memory retention, and tool usage, many limitations that originally motivated MAS designs are being mitigated by single-agent capability improvements. The empirical study finds that across various agentic applications, the performance gap between MAS and SAS narrows with stronger models — and SAS consistently outperforms MAS in a substantial portion of cases.
Three MAS defect types formalized as dependency graph problems:
Node-level defect: Both MAS and SAS performance are bottlenecked by the critical agent responsible for the most difficult subtask. MAS cannot escape the ceiling set by its weakest critical component. Adding more agents does not help if the hardest subtask remains unsolved.
Edge-level defect: Downstream agents become overwhelmed by inputs from upstream agents. In multi-way conversations or prolonged iterative refinements, high in-degree nodes (summarizers, synthesizers) receive more information than they can process effectively, leading to overthinking on edge cases. This is "analogous to the overthinking of the reasoning model, but rather than being lost in thinking, the agent becomes overwhelmed by inputs from upstream agents." MAS aggravates the problem because agents process much more data.
Path-level defect: Indecisive errors propagate through chains of agent interactions. Crucial context is lost or diluted when intermediate outputs are summarized or filtered. Even small information loss causes irreversible errors downstream via snowball effects. The specific failure mode: correct solutions proposed in earlier rounds get lost during summarization before reaching the next agent — "this loss is unrecoverable, as downstream agents no longer have access to the full previous results."
The hybrid solution. Confidence-guided routing between SAS and MAS — request cascading — selectively offloads requests based on difficulty. The approach improves accuracy by 1.1-12% while reducing costs up to 88%. AIME (hardest math) is the exception where MAS consistently outperforms, illustrating MAS value for extremely difficult tasks.
This extends When does adding more agents actually help systems?: the scaling laws quantify MAS overhead, while this paper shows the overhead becoming less worthwhile as single-agent capability increases. Since Why do multi-agent LLM systems converge without real debate?, MAS suffers from both coordination overhead AND pseudo-agreement — making the case for SAS with selective MAS escalation.
Source: Agentic Research
Related concepts in this collection
-
When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
quantifies the overhead; this paper shows it becoming less worthwhile
-
Why do multi-agent LLM systems converge without real debate?
When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.
MAS suffers coordination overhead AND pseudo-agreement
-
Does token spending drive multi-agent research performance?
Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?
if tokens drive performance, a single capable model may be more efficient than many smaller ones
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
edge-level defect is external-input-induced overthinking paralleling internal overthinking
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
multi-agent system advantages diminish as single-agent LLM capabilities improve — three defect types in MAS dependency graphs explain when single beats multi