Does model capability still matter once coordination infrastructure is optimized?

This explores whether raw model intelligence still drives outcomes once the scaffolding around models — routing, protocols, coordination standards — is mature, or whether the bottleneck simply shifts elsewhere.

This explores whether raw model intelligence still drives outcomes once the scaffolding around models — routing, protocols, coordination standards — is mature. The corpus suggests the honest answer is: it depends on what stage of the agent economy you're in, and the bottleneck moves rather than disappearing. The clearest statement of the shift is that once agents hold credentials, transact value, and interact with each other, raw model capability stops being the limiting factor — the binding constraint becomes whether they can coordinate reliably, settle accounts, and leave auditable evidence When do agents need coordination more than raw capability?. That's an economic argument: capability is necessary but no longer scarce, so the marginal returns move to infrastructure.

But the corpus also pushes back from the opposite direction. Multi-agent coordination — the thing infrastructure is supposed to deliver — turns out to be surprisingly capability-bound. The advantage of multi-agent systems shrinks as single-agent models get stronger, because better models stop producing the node bottlenecks, edge overwhelm, and error propagation that coordination was compensating for When do multi-agent systems actually outperform single agents?. And coordination doesn't degrade randomly: it fails predictably at scale through timing failures and agents uncritically accepting neighbors' information Why do multi-agent systems fail to coordinate at scale?. So 'optimized coordination infrastructure' is partly a euphemism for 'models good enough not to need babysitting.'

There's a sharper, more deflating finding underneath all this: a large analysis of test-time scaling found that roughly 80% of multi-agent performance variance comes from token budget, not coordination intelligence How does test-time scaling work at the agent level?. If most of the apparent value of coordination is just spending more compute, then 'optimizing infrastructure' and 'throwing capability at the problem' may be closer to the same lever than they look. The counterweight is that selection can beat scale — routing queries to specialized models per semantic cluster outperforms a single frontier model, or matches it far cheaper Can routing beat building one better model?. That's a genuine case where infrastructure (the router) substitutes for capability (a bigger model).

The thread that reframes the whole question is that 'capability' is not one number. Agent capability is a vector across separable axes — task success, privacy, long-horizon retention, mode-shift behavior, ecosystem readiness — and models that top one axis often rank low on others Does a single benchmark score actually predict agent readiness?. Much of what we call coordination infrastructure (versioned capability vectors for discovery Can semantic capability vectors replace manual agent routing?, protocols that wrap rather than replace existing standards Should coordination protocols wrap existing systems or replace them?, structured artifacts instead of chat Does structured artifact sharing outperform conversational coordination?) is really about exercising the axes that raw benchmark capability ignores. So the question dissolves a little: optimized infrastructure doesn't make capability stop mattering — it changes which dimension of capability is on the critical path.

The thing you might not have known you wanted to know: even a 'capable' model can be hollow underneath. Two models with identical benchmark scores can have fundamentally different internal organization, one of them fractured and brittle to distribution shift in ways no metric reveals Can models be smart without organized internal structure?. And no amount of coordination scaffolding closes the gap that pure self-improvement leaves — reliable improvement always smuggles in an external signal: a judge, a tool, a user correction Can models reliably improve themselves without external feedback?. Infrastructure and capability aren't competitors so much as two names for where you put the external anchor.

Sources 11 notes

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

When do multi-agent systems actually outperform single agents?

Empirical analysis shows MAS performance gaps narrow with stronger models, with SAS outperforming in many cases. Three formal defect types—node-level bottlenecks, edge-level overwhelm, and path-level error propagation—explain when single agents win.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

Should coordination protocols wrap existing systems or replace them?

Research shows that agent coordination standards achieve adoption by composing existing protocols like MCP and DIDComm under a shared substrate, rather than competing to replace them. Bridging lets value accrue incrementally without forcing ecosystem-wide rewrites.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does model capability still matter once coordination infrastructure is optimized?

Sources 11 notes

Next inquiring lines