Agentic and Multi-Agent Systems

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Note · 2026-02-23 · sourced from Agents Multi Architecture

Anthropic's internal evaluation of their multi-agent research system reveals a surprising decomposition: on the BrowseComp evaluation, token usage by itself explains 80% of the performance variance, with the number of tool calls and model choice as the remaining two explanatory factors. Together, these three factors explain 95% of variance.

The implication is uncomfortable: multi-agent systems work primarily because they spend enough tokens, not because they coordinate intelligently. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects simultaneously before condensing the most important tokens for the lead agent. Each subagent provides separation of concerns — distinct tools, prompts, and exploration trajectories — which reduces path dependency.

However, the economics are revealing. Multi-agent with Claude Opus as lead and Claude Sonnet subagents outperforms single-agent Opus by 90.2% on breadth-first research. Agents use roughly 4× more tokens than chat interactions, and multi-agent systems use approximately 15× more tokens than chats. Upgrading to a newer Claude Sonnet is a larger performance gain than doubling the token budget on the older model — meaning model capability multiplies token efficiency.

The practical design principle: multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents. But they excel specifically at tasks involving heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools. Tasks requiring shared context or with many inter-agent dependencies are not a good fit.

Since Does search budget scale like reasoning tokens for answer quality?, the Anthropic finding extends the TTS law from search steps to token budget directly — and confirms that the scaling mechanism is fundamentally about compute quantity, not coordination quality.


Source: Agents Multi Architecture

Related concepts in this collection

Concept map
14 direct connections · 104 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

multi-agent research performance is primarily a token spending function — token usage explains 80 percent of variance while model choice and tool calls explain the remainder