Can two agents with identical token counts produce vastly different outputs?

This explores whether token count — the headline number everyone uses to predict agent performance — actually determines what an agent produces, or whether two agents spending the same tokens can diverge wildly in output.

This explores whether token count determines an agent's output, or whether identical budgets can yield vastly different results. The short answer the corpus gives: yes, identical token counts can produce very different outputs — and the *reasons why* are more interesting than the question first suggests.

Start with the surprising baseline. Anthropic's evals found that token spending alone explains about 80% of the performance variance in multi-agent research systems Does token spending drive multi-agent research performance? Are multi-agent systems actually intelligent coordination or just token spending?. That makes it tempting to treat tokens as destiny. But 80% is not 100%, and the remaining gap is exactly where two equal-budget agents diverge. *How* you spend tokens turns out to matter as much as how many you spend: shared-prefix tree rollouts produce far more distinct trajectories per token than independent chains, because branching from a common prefix avoids re-paying for redundant work Can shared-prefix trees reduce redundancy in agent rollouts?. Same budget, more genuine exploration — and therefore different, better outputs.

The most direct evidence sits one level deeper, at the prompt itself. Two agents can receive semantically identical instructions, spend identical tokens, and still produce systematically different quality — because models respond to how often a phrasing appeared in pre-training, not to its meaning. Higher-frequency wordings win Why do semantically identical prompts produce different LLM outputs?. So 'identical token count' hides a lot: the same count can carry phrasings the model is fluent in or phrasings it stumbles over.

Architecture widens the gap further. A single LLM running structured persona-simulation prompts can replicate what a whole multi-agent debate does — meaning the *shape* of how tokens are organized, not the raw count, drives the outcome Can branching prompts replicate what multi-agent systems do?. And agents that share reasoning through latent representations (KV caches) rather than re-serializing everything into text hit large accuracy gains while *cutting* tokens by 70-80% Can agents share thoughts without converting them to text? Can agents share thoughts directly without using language?. That's the inversion of the question: not just different outputs at equal tokens, but better outputs at fewer.

The thing you might not have known you wanted to know: token count is a *cost* unit, not a *quality* unit, and the field is quietly migrating away from it. Persistent agents found 83% of their tokens were cache reads, pushing the meaningful denominator from tokens to completed artifacts Do persistent agents really cost less per token?. Two agents with identical token counts can differ because of trajectory structure, prompt phrasing frequency, prompting architecture, and whether they communicate in text or latent space — which is why counting tokens tells you what an agent *spent*, never quite what it *did*.

Sources 8 notes

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

Are multi-agent systems actually intelligent coordination or just token spending?

Research shows token usage explains 80% of multi-agent performance variance, systems use 15× more tokens than single agents, and coordination yields negative returns above 45% accuracy. Performance gains come from token distribution, not coordination sophistication.

Can shared-prefix trees reduce redundancy in agent rollouts?

Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Can branching prompts replicate what multi-agent systems do?

Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.

Can agents share thoughts without converting them to text?

LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.

Can agents share thoughts directly without using language?

Research formalizes inter-agent thought sharing via sparse autoencoders that recover individual, shared, and private latent thoughts from hidden states. This approach detects alignment conflicts at the representational level before they manifest in language.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether token count determines agent output diversity. The question remains open: can two agents with identical token budgets produce vastly different results, and if so, what mechanisms drive the divergence?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and include:
• Token spending alone explains ~80% of multi-agent performance variance, leaving ~20% unexplained by budget alone (~2025).
• Shared-prefix tree rollouts generate more distinct trajectories per token than independent chains by avoiding redundant prefix re-computation (~2025).
• Semantically identical prompts produce systematically different outputs because models respond to pre-training frequency, not meaning alone (~2026).
• Single LLMs running structured persona-simulation can replicate multi-agent debate outcomes; token *organization* (not count) drives results (~2026).
• Latent communication (KV caches) vs. text serialization can cut tokens 70–80% while improving accuracy; persistent agents show 83% of tokens are cache reads (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2509.21240 (Tree Search for LLM Agent RL, 2025-09)
• arXiv:2510.20733 (Thought Communication in Multiagent Collaboration, 2025-10)
• arXiv:2604.02176 (Adam's Law: Textual Frequency Law on LLMs, 2026-04)
• arXiv:2605.26870 (Persistent AI Agents in Academic Research, 2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 80% token-variance claim: has scaling to frontier models (o1, o3, Gemini 2.0, or equivalents) since tightened or loosened this bound? Do latent-space methods now dominate text-serialized multi-agent work, overturning the token-count framing entirely? Test whether prompt-frequency effects (Adam's Law) persist across instruction-tuned and RL-trained variants.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Prioritize papers showing single-agent systems match or exceed multi-agent performance, or work proving latent collaboration irreducible to token metrics.
(3) Propose 2 research questions that ASSUME the regime has shifted: (a) If latent communication is now standard, what replaces token count as the unit of agent *capability* (not cost)? (b) Can you predict output *diversity* at fixed token budgets from prompt-frequency statistics alone, or does architecture (tree search, persona-stacking, KV reuse) introduce irreducible variance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can two agents with identical token counts produce vastly different outputs?

Sources 8 notes

Next inquiring lines