Which research tasks are better suited for multi-agent versus single-agent approaches?

This explores the dividing line — what kinds of work actually benefit from splitting across coordinating agents, versus what a single capable model does better alone.

This explores where the multi-agent/single-agent boundary actually falls, and the corpus is more opinionated about it than the hype suggests: the answer is task-dependent, and several notes converge on *which* features of a task tip the balance. The clearest win for multi-agent is complex synthesis that overflows one context window — literature review and scientific writing, where PaperOrchestra's specialized agents beat autonomous single-model baselines by 50–68% on review quality precisely because distributed coordination dodges single-model context failures Can specialized agents write better scientific papers than single models?. So the heuristic isn't 'multi-agent is better,' it's 'multi-agent is better when the task is too big or too multi-faceted for one model to hold at once.'

The counterweight is just as sharp: multi-agent advantages *shrink as single models get smarter*. One analysis finds single-agent systems actually win in many cases, and names three failure modes that explain when coordination backfires — bottleneck nodes, agents overwhelmed by too many connections, and errors propagating along the path When do multi-agent systems actually outperform single agents?. A complementary study across 180 configurations turns this into thresholds: coordination *stops helping once accuracy passes ~45%*, tool-coordination trade-offs actively hurt complex tasks, and your topology can amplify errors 4–17×. The punchline there is that architecture-task alignment, not agent count, decides outcomes When does adding more agents actually help systems?.

There's a deflating undercurrent worth knowing: a lot of what looks like 'multi-agent intelligence' is just spending. Anthropic's internal evals attribute ~80% of performance variance to token budget rather than coordination cleverness — and a model upgrade buys more than doubling the budget Does token spending drive multi-agent research performance? What makes multi-agent teams actually perform better?. That reframes the question: before reaching for a swarm, ask whether you just need a better model or a bigger budget on one agent.

When a task genuinely is multi-agent-shaped, two notes sharpen *how*. Open-ended ideation and research framing benefit from cognitive diversity — but only when every agent carries real senior domain expertise; diverse-but-shallow teams underperform a single competent agent because stimulation without grounding turns into process loss Does cognitive diversity alone improve multi-agent ideation quality?. And structure matters more than autonomy: a 25,000-task experiment found hybrid protocols — fixed external ordering but agents choosing their own roles and self-abstaining when out of their depth — beat both rigid hierarchies (+14%) and fully autonomous swarms (+44%) Do self-organizing agent teams outperform rigid hierarchies?.

The thread the reader probably didn't expect: the most economical design isn't a choice between the two at all. Most agentic *subtasks* are repetitive, well-defined language work that small models handle at 10–30× lower cost, so the rational architecture is heterogeneous — small models by default, large models called in selectively Can small language models handle most agent tasks?. And the further you scale coordination, the more the bottleneck stops being capability and becomes coordination itself — agents agree too late or adopt strategies without telling neighbors, accepting unverified information and propagating errors Why do multi-agent systems fail to coordinate at scale?. So the real boundary: single-agent (or small-model) for bounded, well-defined work; multi-agent for context-overflowing synthesis and expertise-diverse ideation — and even then, structure and verification, not agent count, are what earn the gains.

Sources 9 notes

Can specialized agents write better scientific papers than single models?

PaperOrchestra's specialized agents achieved 50-68% absolute win margins on literature review quality and 14-38% on overall manuscript quality versus autonomous baselines in human evaluation. Distributed coordination prevents single-model context window failures on complex synthesis tasks.

When do multi-agent systems actually outperform single agents?

Empirical analysis shows MAS performance gaps narrow with stronger models, with SAS outperforming in many cases. Three formal defect types—node-level bottlenecks, edge-level overwhelm, and path-level error propagation—explain when single agents win.

When does adding more agents actually help systems?

Across 180 configurations, three dominant effects predict multi-agent success: tool-coordination trade-offs harm complex tasks, coordination stops helping above 45% accuracy, and topology choice controls error amplification by 4–17×. Architecture-task alignment, not agent count, determines outcomes.

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

What makes multi-agent teams actually perform better?

Research shows 80% of performance variance across multi-agent systems stems from token budget, not coordination intelligence. Latent communication and shared cache architectures bypass this token tax by avoiding natural language bottlenecks.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Do self-organizing agent teams outperform rigid hierarchies?

A 25,000-task experiment across 8 models and multiple agent counts showed that sequential protocols with external ordering but internal role selection outperform centralized systems by 14% and fully autonomous systems by 44%. Agents spontaneously invented specialized roles and self-abstained when incompetent.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Which research tasks are better suited for multi-agent versus single-agent approaches?

Sources 9 notes

Next inquiring lines