Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures

Paper · arXiv 2603.28990 · Published March 30, 2026

Abstract—We present a 25,000-task computational experiment comparing coordination architectures in multi-agent LLM systems across 8 models, 4–256 agents, and 8 protocols. Our key finding is the endogeneity paradox: a hybrid protocol (Sequential) where agent ordering is fixed but role selection is autonomous outperforms both centralized coordination (+14%, p < 0.001) and fully autonomous protocols (+44%, Cohen’s d = 1.86, p < 0.0001). Effective self-organization requires both a capable model and the right protocol—neither alone suffices; models below a capability threshold exhibit a reversal where rigid structure outperforms autonomy. The system scales sub-linearly to 256 agents (p = 0.61) and exhibits emergent properties: dynamic role invention (5,006 unique roles from 8 agents), voluntary self-abstention, and spontaneous hierarchy formation. Results are reproduced across closed-source and open-source models, with open-source achieving 95% quality at 24× lower cost.

AI agents need three things to self-organize—and none of them is a pre-assigned role. Given a mission, a communication protocol, and a sufficiently capable model, groups of LLM-based agents spontaneously form organizational structures, invent specialized roles, and voluntarily abstain from tasks outside their competence—outperforming systems with externally designed hierarchies by 14% (p < 0.001). But remove any of the three ingredients, and the system collapses: without a strong model, self-organization reverses and rigid structure becomes necessary; without the right protocol, even the strongest model underperforms.

These are the findings of a 25,000-task computational experiment—the largest to date—comparing coordination architectures in multi-agent systems based on large language models (LLMs). A fundamental question has been overlooked: what coordination architecture enables the best trade-off between solution quality, cost, scalability, and resilience to disruptions?

Current research splits into two directions. Vertical selfimprovement focuses on making individual agents smarter— exemplified by Meta’s DGM-Hyperagents [10], which achieves open-ended self-improvement through metacognitive self-modification. Horizontal coordination addresses how groups of agents collaborate, dominated by systems that replicate human organizational patterns: fixed roles, centralized task allocation, rigid hierarchies [1]–[4]. Vertical selfimprovement does not answer how multiple agents should coordinate; horizontal frameworks provide structure but may impose unnecessary constraints on agents whose computational nature is fundamentally different from human workers— an LLM agent can instantaneously change specialization, process the full organizational context, and contribute zero marginal cost when idle.

This paper addresses horizontal coordination with a key insight: effective self-organization requires two conditions simultaneously— a capable foundation model and the right coordination protocol. The protocol unlocks the model’s potential, like sheet music unlocks an orchestra; but an orchestra of beginners (weak models) plays better with a conductor than without one.

In this work, we conduct the largest systematic computational experiment on coordination in multi-agent LLM systems to date, spanning:

• 25,000+ task runs across 20,810 unique configurations;

• 8 LLM models (closed-source: Claude Sonnet 4.6, GPT-

5.4, GPT-4o, GPT-4.1-mini, Gemini-3-flash, GigaChat 2 Max; open-source: DeepSeek v3.2, GLM-5);

• 4 to 256 agents per system;

• 8 coordination protocols, from centralized (Coordinator) to fully autonomous (Shared);

• 4 task complexity levels (L1–L4), from single-domain to adversarial multi-stakeholder tasks.

We distinguish between exogenous coordination (structure imposed externally) and endogenous coordination (structure emerging from within the system). Our central finding is the endogeneity paradox: neither maximal external control nor maximal agent autonomy produces optimal results. Instead, a hybrid protocol that provides minimal structural scaffolding (fixed ordering) while allowing maximal role autonomy (selfselected specialization) achieves significantly superior outcomes. The main contributions of this paper are:

A framework for characterizing coordination protocols from exogenous (externally controlled) to endogenous (self-organized), with empirical validation across 8 protocols.
The discovery of the endogeneity paradox: the hybrid Sequential protocol outperforms the fully decentralized

Shared protocol by 44% in a controlled pilot (Cohen’s d = 1.86, p < 0.0001), and outperforms the centralized Coordinator by 14% at scale (p < 0.001).

Evidence that among strong models, coordination protocol choice (44% quality variation) and model selection (∼ 14%) are both critical, with neither alone sufficient for self-organization.
Demonstration of sub-linear scaling from 4 to 256 agents without quality degradation (p = 0.61), with emergent phenomena including dynamic role invention (RSI → 0), voluntary self-abstention, and shallow self-organized hierarchies.
Cross-validation of self-organization across closedsource and open-source LLMs, establishing a capability threshold below which self-organization reverses and fixed structure becomes beneficial.
A three-ring constitutional framework for governing autonomous multi-agent organizations.

A. Multi-Agent LLM Systems

Multi-agent LLM systems have gained significant attention, with several comprehensive surveys mapping the landscape [19]–[21]. Prominent frameworks include ChatDev [1], which assigns fixed software engineering roles to agents in a waterfall pipeline; MetaGPT [2], which encodes Standard Operating Procedures as inter-agent protocols; and Auto- Gen [3], which provides a conversation-based framework for multi-agent collaboration. AgentVerse [4] introduces dynamic team formation but retains a centralized “recruiter” agent. GPTSwarm [16] models agents as optimizable computation graphs, and Mixture-of-Agents [17] demonstrates that layering LLM outputs improves quality. Recent work on scaling multiagent collaboration [18] has explored team size effects but with fixed architectures. These systems use exogenous coordination: roles, hierarchies, and interaction patterns are designed by humans and fixed before execution.

B. Emergent Coordination and Self-Organization

Self-organization in multi-agent systems has deep roots in both classical MAS theory [11]–[13] and biological complexity science [14], [15]. In the LLM era, recent work has explored more autonomous coordination. EvoAgentX [5] uses evolutionary optimization (TextGrad) to adapt agent populations but requires gradient-based training. AgentNet [6] retrieves optimal Directed Acyclic Graphs (DAGs) for agent routing but remains centralized. MAS-ZERO [7] employs a meta-designer for zero-shot multi-agent generation but lacks runtime adaptation. ReSo [8] trains a Contribution Reward Model for DAG optimization, requiring labeled data. HiVA [9] proposes semantic-topological evolution but has been tested only at small scale

C. Self-Improving Agent Systems

A complementary research direction focuses on individual agents that recursively improve themselves. The Darwin Gödel Machine (DGM) and its extension DGM-Hyperagents [10] achieve impressive open-ended self-improvement through metacognitive self-modification, where the improvement procedure itself is editable. This work advances vertical intelligence—making each agent individually stronger. Our work advances horizontal intelligence—making groups of agents collectively effective. The two directions are orthogonal and synergistic: stronger individual agents (as produced by Hyperagents-style self-improvement) benefit more from selforganizing coordination protocols (as studied here). Together, they represent two complementary paths toward more capable AI systems.

D. Gap and Positioning

Existing approaches address different facets of the multiagent challenge: fixed-architecture coordination [1], [2], training-based adaptation [5], [8], and individual agent selfimprovement [10]. Our study contributes to this landscape by focusing on a question that has received less attention: how does the degree of agent autonomy in coordination— from centralized to fully self-organized—affect collective performance at scale? To our knowledge, this is the first work to systematically vary coordination protocols across an exogenous-to-endogenous spectrum, test groups up to 256 agents, compare 8 LLM models, and demonstrate zero-shot runtime self-organization (Table I).