When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
The question of when multi-agent systems help and when they hurt has been answered with heuristics. This paper replaces heuristics with measurement. Across 180 configurations (5 architectures × 3 LLM families × 4 benchmarks), three dominant effects emerge:
1. Tool-coordination trade-off (β=−0.330, p<0.001): tool-heavy tasks suffer disproportionately from multi-agent overhead. The mechanism is token budget fragmentation — multi-agent systems split per-agent capacity, leaving insufficient tokens for complex tool orchestration. A 16-tool software engineering task under multi-agent coordination loses more than a 2-tool financial reasoning task.
2. Capability saturation (β=−0.408, p<0.001): once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. Coordination costs exceed improvement potential. This is a measurable threshold, not a vague guideline.
3. Topology-dependent error amplification: independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4× via validation bottlenecks that catch errors before aggregation. The architecture is the error control mechanism.
The practical consequences are sharp. Centralized coordination improves performance by 80.9% on parallelizable tasks (financial reasoning). Decentralized coordination excels on dynamic web navigation (+9.2% vs +0.2%). But for sequential reasoning tasks, every multi-agent variant degrades performance by 39-70%. Architecture-task alignment, not agent count, determines success.
The predictive model (R²=0.513, 87% accuracy on held-out configurations) uses measurable task properties — not post-hoc analysis. This means architecture selection can be principled rather than intuitive. The underlying mechanisms are interpretable: fragmentation, overhead exceeding marginal gains, and error propagation without validation.
Since How should we balance parallel versus sequential compute at test time?, this finding provides the multi-agent instantiation: parallel multi-agent coordination helps for parallelizable tasks, hurts for sequential ones. The 45% saturation threshold adds a quantitative decision boundary that the TTS literature lacks.
MasRouter's per-query topology routing (from Arxiv/Routers): MasRouter directly addresses the topology-dependent error amplification finding. Rather than choosing a fixed topology and accepting its scaling limitations, MasRouter routes each query to the optimal collaboration mode (Chain/Tree/Graph) via a variational latent variable model. This transforms topology from a fixed architectural choice into a per-query routing decision — the system can use centralized coordination for tasks where error propagation matters (financial reasoning) and decentralized coordination for dynamic tasks (web navigation). The 87% prediction accuracy of the scaling laws framework suggests routing decisions could be validated: does MasRouter's topology selection correlate with what the scaling laws predict would work best? See What decisions must multi-agent routing systems optimize simultaneously?.
The endogeneity paradox: autonomy degree is itself a scaling variable. The largest coordination experiment to date (25,000 tasks, 8 models, 4-256 agents, Drop the Hierarchy and Roles) reveals that the optimal coordination topology is not fixed but depends on model capability. A hybrid protocol with fixed ordering but autonomous role selection outperforms both centralized (+14%) and fully autonomous (+44%) coordination. Below a capability threshold, the relationship reverses — weak models need rigid structure. This adds a fourth scaling law: the degree of endogenous coordination is capability-contingent. The topology-dependent error amplification finding from this note interacts with autonomy level: self-organizing agents with strong models develop voluntary self-abstention (agents withdraw when they lack competence) and dynamic role invention (5,006 unique roles from 8 agents), producing emergent structures that fixed topologies cannot match. See Do self-organizing agent teams outperform rigid hierarchies?.
SAS vs MAS capabilities converge as frontier models improve. "Single-agent or Multi-agent? Why Not Both?" (2025) finds that MAS benefits diminish as LLMs gain long-context reasoning, memory retention, and tool use — mitigating the limitations that originally motivated MAS designs. Three defect types formalized as dependency graph problems: node-level (bottleneck agent caps performance), edge-level (downstream agents overwhelmed by upstream inputs — analogous to overthinking from external information), path-level (indecisive errors propagate as crucial context is lost during inter-agent summarization). A hybrid SAS/MAS cascading approach using confidence-guided routing improves accuracy 1.1-12% while reducing costs up to 88%. The exception: AIME (hardest math) where MAS consistently outperforms, confirming MAS value for extreme difficulty.
Source: Agents Multi Architecture
Related concepts in this collection
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
the same parallel/sequential dichotomy at the agent level rather than the token level
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
single-agent token-level parallel scaling; multi-agent is the system-level analog with different economics
-
Why do multi-agent LLM systems converge without real debate?
When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.
error amplification connects: independent agents propagate errors; silent agreement is one mechanism
-
Can extreme task decomposition enable reliable execution at million-step scale?
Can breaking tasks into maximally atomic subtasks with voting-based error correction solve the fundamental reliability problem in long-horizon tasks? This challenges whether better models or better decomposition is the path to high-reliability AI systems.
MAKER's extreme decomposition as one architecture choice; this paper quantifies when decomposition helps vs hurts
-
What decisions must multi-agent routing systems optimize simultaneously?
Standard LLM routing only picks which model to use. But multi-agent systems involve four interdependent choices: topology, agent count, role assignment, and per-agent model selection. Does optimizing all four together actually improve performance?
MasRouter: per-query topology routing as response to topology-dependent error amplification
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
multi-agent scaling follows three quantitative laws — tool-coordination trade-off capability saturation at 45 percent and topology-dependent error amplification