Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Paper · arXiv 2604.02460 · Published April 2, 2026

Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent’s effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.

Multi-agent LLM architectures (MAS), including planners, role-playing systems, debate frameworks, and tool-specialized swarms, have demonstrated strong empirical performance across a range of tasks. At a high level, these approaches decompose reasoning across multiple agents that operate over partial contexts and communicate via generated text. In contrast, single-agent systems (SAS) perform reasoning within a single, unified context, relying on internal token-level computation rather than explicit inter-agent communication.

However, comparisons between MAS and SAS are often confounded by differences in testtime computation. MAS typically consume more tokens through longer reasoning traces or multiple agent interactions, making it unclear whether their gains arise from architectural advantages or simply from increased compute. Recent budget-aware studies suggest that, when computation is normalized, many such strategies underperform strong single-agent baselines (Wang et al., 2024; Han et al., 2025).

In this work, we revisit this question under an explicit focus on thinking token budgets, which we define as the total number of tokens used for intermediate reasoning, excluding prompts and final answers. We focus on multi-hop reasoning tasks and ask three central questions: why might SAS outperform MAS under fixed budgets, when do MAS become competitive, and how should such comparisons be conducted reliably?

We first provide an information-theoretic perspective, based on the Data Processing Inequality, suggesting that under fixed token budgets, multi-agent decompositions introduce additional communication bottlenecks that can lead to information loss. This perspective also clarifies when MAS can be advantageous: specifically, when a single agent’s effective context utilization is degraded (e.g., due to long or noisy contexts), or when MAS benefit from additional unaccounted computation through extended interactions.

We then test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS and multiple MAS architectures under matched reasoning-token budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks under these constraints.