How do evaluation methods differ for single versus multi-agent systems?

This explores whether you evaluate a multi-agent system differently than a single agent — and the corpus suggests the harder question isn't single vs. multi, but that both break the same assumption: that you can score an agent by its final answer.

This explores whether evaluating multi-agent systems calls for different methods than evaluating single agents. The surprising answer the corpus keeps circling back to: the deeper divide isn't between one agent and many — it's between *outcome* evaluation (did it get the answer?) and *trajectory* evaluation (how did it get there, and would it again?). On that axis, single and multi-agent systems demand the same shift, just for different reasons.

For single agents, the argument is that one-shot task success is a misleading number. Real performance lives in trajectory quality, memory hygiene, context efficiency, and verification cost — dimensions a single score collapses into false confidence What should we actually measure in agent evaluation?. Push this further and the unit of evaluation stops being the model at all: reliability comes from what's externalized into the harness — memory, skills, protocols — so you're really grading the scaffolding, not the brain Where does agent reliability actually come from?. Taken to its limit, the right thing to measure is the whole coupled human-agent-environment over many sessions, because the capacity that matters (accumulated context, reusable procedures) literally doesn't exist at the episode level where most benchmarks look Should we evaluate deployed agents as whole environments instead?.

Multi-agent evaluation inherits all of that and adds a coordination layer you simply can't see in a single agent. Here the failure modes are structural and social: silent agreement, degeneration of thought, agents accepting a neighbor's claim without verifying it and propagating the error downstream Why do multi-agent systems fail despite individual capability? Why do multi-agent systems fail to coordinate at scale?. So the evaluation questions change shape — not just "was the output right?" but "which agent actually contributed?" Contribution-scoring methods like DyLAN's propagation-aggregation-selection make individual agents measurable inside the team, even deactivating freeloaders mid-run Can multi-agent teams automatically remove their weakest members?. And topology becomes a variable to test, since architecture choice swings error amplification by 4–17× — meaning two systems with identical agents can score wildly differently based purely on wiring When does adding more agents actually help systems?.

The trap the corpus warns about hardest: most reported multi-agent wins are confounded by spending. Roughly 80% of multi-agent performance variance turns out to be a function of token budget, not coordination intelligence — so any honest comparison has to hold tokens fixed, or you're just measuring who paid more How does test-time scaling work at the agent level?. That reframes the whole single-vs-multi question. Once you control for spend, multi-agent advantages shrink as base models improve, and single agents win outright in many cases When do multi-agent systems actually outperform single agents?. Evaluation also has to check a precondition coordination can't fix — diverse agents without genuine domain expertise underperform a single competent agent, because stimulation without grounding produces process loss, not insight Does cognitive diversity alone improve multi-agent ideation quality?.

So here's the thing you didn't know you wanted to know: evaluating a multi-agent system fairly usually means first proving it deserves to be multi-agent at all. The same methods that grade a single agent's trajectory become the control group — and a lot of multi-agent benchmarks, once you subtract the extra tokens and the credit a single strong model would have earned alone, are measuring an architecture solving a problem it created for itself.

Sources 10 notes

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Should we evaluate deployed agents as whole environments instead?

A single-investigator case study with 75,671 telemetry records shows that capacity gains come from accumulated context and reusable procedures that only exist across sessions with human direction. Model and episode-level evaluation cannot measure these cross-session variables.

Why do multi-agent systems fail despite individual capability?

Multi-agent systems exhibit specific failure modes—silent agreement, degeneration of thought, and social accommodation—that mirror individual reasoning failures at group scale. Real-world autonomous task completion plateaus near 30% regardless of agent count; capability gains require deliberation diversity, expertise prerequisites, and formal coordination architectures.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can multi-agent teams automatically remove their weakest members?

DyLAN's three-step importance scoring mechanism (propagation, aggregation, selection) quantifies individual agent contributions and automatically removes uninformative agents during inference, optimizing team composition without task-specific tuning.

When does adding more agents actually help systems?

Across 180 configurations, three dominant effects predict multi-agent success: tool-coordination trade-offs harm complex tasks, coordination stops helping above 45% accuracy, and topology choice controls error amplification by 4–17×. Architecture-task alignment, not agent count, determines outcomes.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

When do multi-agent systems actually outperform single agents?

Empirical analysis shows MAS performance gaps narrow with stronger models, with SAS outperforming in many cases. Three formal defect types—node-level bottlenecks, edge-level overwhelm, and path-level error propagation—explain when single agents win.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

How do evaluation methods differ for single versus multi-agent systems?

Sources 10 notes

Next inquiring lines