Can latent communication reduce the token cost of multi-agent systems?
This explores whether letting agents exchange internal representations (their 'latent thoughts') instead of writing and re-reading text messages can cut the token bills that dominate multi-agent system costs.
This explores whether latent communication — agents sharing hidden states rather than serialized text — can reduce token cost in multi-agent systems. The corpus says yes, and unusually directly: the most concrete result is LatentMAS, where agents pass internal representations through shared KV caches rather than writing messages to each other, achieving 70.8–83.7% token reduction with a 14.6% accuracy gain and no additional training Can agents share thoughts without converting them to text?. The reason it pays off twice — cheaper and better — is that text serialization is lossy: forcing reasoning through natural language discards fidelity that hidden embeddings preserve. A related line formalizes the same intuition, using sparse autoencoders to extract individual, shared, and private latent thoughts from hidden states, which also lets agents detect alignment conflicts at the representational level before they ever surface in language Can agents share thoughts directly without using language?.
Why this matters so much becomes clear once you see what actually drives multi-agent performance. Anthropic's internal evals found that roughly 80% of the performance variance in multi-agent research systems comes from token spending, not coordination cleverness Does token spending drive multi-agent research performance?, a finding echoed in the broader framing of agent-level test-time scaling as 'primarily a token spending function' How does test-time scaling work at the agent level?. If performance is bought with tokens, then anything that decouples capability from token volume — LatentMAS and shared-KV-cache approaches are named explicitly here — changes the economics directly rather than at the margins.
Latent exchange isn't the only lever, and reading them together is where it gets interesting. One approach attacks the cost by changing the message format without leaving language: MetaGPT shows that having agents produce standardized engineering artifacts and pull from a shared environment beats free-form conversational chatter, eliminating noise Does structured artifact sharing outperform conversational coordination?. Another attacks the substrate: most agentic subtasks are repetitive and well-defined enough that small language models handle them at 10–30× lower cost, making latent efficiency and model right-sizing complementary rather than competing savings Can small language models handle most agent tasks?. And a third attacks memory: DeepAgent's autonomous memory folding compresses interaction history into structured schemas, cutting token overhead while preserving the details that matter Can agents compress their own memory without losing critical details?.
There's a deeper reframing worth noticing. A 115-day case study found 82.9% of tokens were cache reads, and argued the meaningful cost denominator stops being the individual token and becomes the completed artifact Do persistent agents really cost less per token?. Latent communication and persistent caching are two routes to the same destination: stop paying to re-serialize and re-read what the system already knows.
One caution the corpus raises against pure efficiency-chasing: cheaper communication doesn't fix coordination. Multi-agent systems degrade predictably with scale because agents agree too late or adopt strategies without informing neighbors, and they tend to accept neighbor information without verification — letting errors propagate Why do multi-agent systems fail to coordinate at scale?. Consensus tends to fail through stalled convergence (liveness loss) rather than corrupted values Can LLM agent groups reliably reach consensus together?. The intriguing implication is that latent communication might help on both fronts at once: the same hidden-state sharing that saves tokens also exposes alignment conflicts at the representational level — potentially turning a cost optimization into a coordination repair.
Sources 10 notes
LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.
Research formalizes inter-agent thought sharing via sparse autoencoders that recover individual, shared, and private latent thoughts from hidden states. This approach detects alignment conflicts at the representational level before they manifest in language.
Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.