Can two agents with identical token counts produce vastly different outputs?
This explores whether token count — the headline number everyone uses to predict agent performance — actually determines what an agent produces, or whether two agents spending the same tokens can diverge wildly in output.
This explores whether token count determines an agent's output, or whether identical budgets can yield vastly different results. The short answer the corpus gives: yes, identical token counts can produce very different outputs — and the *reasons why* are more interesting than the question first suggests.
Start with the surprising baseline. Anthropic's evals found that token spending alone explains about 80% of the performance variance in multi-agent research systems Does token spending drive multi-agent research performance? Are multi-agent systems actually intelligent coordination or just token spending?. That makes it tempting to treat tokens as destiny. But 80% is not 100%, and the remaining gap is exactly where two equal-budget agents diverge. *How* you spend tokens turns out to matter as much as how many you spend: shared-prefix tree rollouts produce far more distinct trajectories per token than independent chains, because branching from a common prefix avoids re-paying for redundant work Can shared-prefix trees reduce redundancy in agent rollouts?. Same budget, more genuine exploration — and therefore different, better outputs.
The most direct evidence sits one level deeper, at the prompt itself. Two agents can receive semantically identical instructions, spend identical tokens, and still produce systematically different quality — because models respond to how often a phrasing appeared in pre-training, not to its meaning. Higher-frequency wordings win Why do semantically identical prompts produce different LLM outputs?. So 'identical token count' hides a lot: the same count can carry phrasings the model is fluent in or phrasings it stumbles over.
Architecture widens the gap further. A single LLM running structured persona-simulation prompts can replicate what a whole multi-agent debate does — meaning the *shape* of how tokens are organized, not the raw count, drives the outcome Can branching prompts replicate what multi-agent systems do?. And agents that share reasoning through latent representations (KV caches) rather than re-serializing everything into text hit large accuracy gains while *cutting* tokens by 70-80% Can agents share thoughts without converting them to text? Can agents share thoughts directly without using language?. That's the inversion of the question: not just different outputs at equal tokens, but better outputs at fewer.
The thing you might not have known you wanted to know: token count is a *cost* unit, not a *quality* unit, and the field is quietly migrating away from it. Persistent agents found 83% of their tokens were cache reads, pushing the meaningful denominator from tokens to completed artifacts Do persistent agents really cost less per token?. Two agents with identical token counts can differ because of trajectory structure, prompt phrasing frequency, prompting architecture, and whether they communicate in text or latent space — which is why counting tokens tells you what an agent *spent*, never quite what it *did*.
Sources 8 notes
Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.
Research shows token usage explains 80% of multi-agent performance variance, systems use 15× more tokens than single agents, and coordination yields negative returns above 45% accuracy. Performance gains come from token distribution, not coordination sophistication.
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.
LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.
Research formalizes inter-agent thought sharing via sparse autoencoders that recover individual, shared, and private latent thoughts from hidden states. This approach detects alignment conflicts at the representational level before they manifest in language.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.