Does horizontal coordination improve with stronger individual agents?
This explores whether peer-to-peer ('horizontal') coordination among AI agents gets better when each agent is individually smarter — and the corpus answer is mostly counterintuitive: stronger individuals shrink the payoff from coordinating, and the failures that remain aren't fixed by raw capability.
Reading the question as 'do agents coordinate better as peers when each one is more capable?' — the corpus pushes back on the intuition. The most direct finding is that multi-agent advantages actually *diminish* as single-agent capability improves When do multi-agent systems actually outperform single agents?. As models get stronger, the performance gap between a lone agent and a coordinating team narrows, and a single agent often wins outright. So stronger individuals don't supercharge horizontal coordination — they erode the reason to coordinate in the first place. There's even a measured ceiling: across 180 configurations, coordination stops helping once a task is already being solved above ~45% accuracy, and topology (not agent count or smarts) is what controls whether errors get amplified or damped When does adding more agents actually help systems?.
The deeper reason is that the things that break horizontal coordination are *structural*, not capability-bound. Agents in a peer network fail by agreeing too late or by adopting a strategy without telling their neighbors — and crucially they accept neighbor information without verifying it, so one error propagates across the network even though each agent is individually capable of spotting a direct conflict Why do multi-agent systems fail to coordinate at scale?. A smarter agent that still trusts its neighbors uncritically doesn't fix that. The same pattern shows up in consensus: LLM-agent groups fail mostly through 'liveness loss' — timeouts and stalled convergence — rather than getting the answer wrong, and agreement degrades as the group grows even with no malicious agents present Can LLM agent groups reliably reach consensus together?. These are coordination-protocol problems, not intelligence problems.
There's also a sobering deflation of what 'coordination intelligence' even contributes. One analysis finds ~80% of multi-agent performance variance comes from token budget — how much the system spends — not from clever coordination How does test-time scaling work at the agent level?. That reframes a lot of apparent 'better coordination' as simply 'more compute,' which means making individual agents stronger may just be buying the same thing through a different door.
Where coordination *does* improve, the corpus suggests the lever is design, not individual horsepower. Structured artifact-sharing (agents producing standardized documents and pulling from a shared environment) beats free-form conversational exchange Does structured artifact sharing outperform conversational coordination?. Hybrid protocols with fixed external ordering but autonomous internal role-selection outperform both rigid hierarchies and fully self-organizing swarms — and notably, agents in those systems self-abstain when they're incompetent, which is a coordination behavior, not a capability one Do self-organizing agent teams outperform rigid hierarchies?. And teams can lift the floor by deactivating their weakest members at inference time via contribution scoring Can multi-agent teams automatically remove their weakest members?, which improves the *group* by editing composition rather than upgrading every agent.
The thing you might not have expected to want to know: the field increasingly argues that as agents become economic actors — holding credentials, transacting, leaving auditable records — raw model capability stops being the binding constraint entirely, and reliable coordination, settlement, and accountability become the bottleneck When do agents need coordination more than raw capability?. In that frame the question almost inverts: it's not 'does coordination improve with stronger agents?' but 'once agents are strong enough, coordination is the only thing left to improve.'
Sources 9 notes
Empirical analysis shows MAS performance gaps narrow with stronger models, with SAS outperforming in many cases. Three formal defect types—node-level bottlenecks, edge-level overwhelm, and path-level error propagation—explain when single agents win.
Across 180 configurations, three dominant effects predict multi-agent success: tool-coordination trade-offs harm complex tasks, coordination stops helping above 45% accuracy, and topology choice controls error amplification by 4–17×. Architecture-task alignment, not agent count, determines outcomes.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.
A 25,000-task experiment across 8 models and multiple agent counts showed that sequential protocols with external ordering but internal role selection outperform centralized systems by 14% and fully autonomous systems by 44%. Agents spontaneously invented specialized roles and self-abstained when incompetent.
DyLAN's three-step importance scoring mechanism (propagation, aggregation, selection) quantifies individual agent contributions and automatically removes uninformative agents during inference, optimizing team composition without task-specific tuning.
Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.