What makes multi-agent teams actually perform better? · Gravity7

Multi-Agent Reasoning and Evaluation

7 notes

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Does cognitive diversity alone improve multi-agent ideation quality?

This explores whether diverse perspectives in group AI systems automatically produce better ideas, or if something else—like expertise—is equally critical for collaborative ideation to outperform solo agents.

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Can personas extracted from documents generalize across evaluation tasks?

This explores whether automating persona creation from domain documents—rather than hand-crafting roles—enables multi-agent evaluators to transfer across different tasks without redesign. The question matters because manual personas fail to generalize across domains.

Can branching prompts replicate what multi-agent systems do?

Explores whether non-linear prompting structures (tree-of-thought, debate prompting) can functionally replace multi-agent architectures, and whether a single LLM simulating multiple personas achieves the same cognitive benefits as multiple models collaborating.

Can tailoring queries per document improve debatable summarization?

When summarizing documents with opposing perspectives on a topic, does adapting the query to each document's unique content retrieve more balanced viewpoints than using a single uniform query?

Debate and Consensus Formation

1 note

How do LLM debates differ from human expert consensus?

Explores why AI debate systems rely on probabilistic reasoning and persuasive framing while human debates are shaped by social authority, trust, and contextual factors. Understanding this gap is crucial for designing AI systems that can effectively handle contested domains.

Scaling Laws and Token Economics

4 notes

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

When do multi-agent systems actually outperform single agents?

As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.

Coordination Failures and Socialization

3 notes

Why do multi-agent systems fail to coordinate at scale?

Explores how LLM agents struggle to synchronize strategy timing and validate information when coordinating across larger networks, revealing fundamental limits in distributed reasoning.

Why do multi-agent LLM systems fail more than expected?

This research asks what specific failure modes cause multi-agent systems to underperform despite their promise. Understanding these failure patterns is essential for building more reliable collaborative AI systems.

Why don't AI agents develop social structure at scale?

When millions of LLM agents interact continuously on a social platform, do they form collective norms and influence hierarchies like human societies? This tests whether scale and interaction density alone drive socialization.

Model Composition and Latent Communication

2 notes

Can language models discover new expertise through collaborative weight search?

Can model experts be composed through particle swarm optimization in weight space without training? This explores whether collaborative search can discover capabilities that no individual expert possesses.

Can agents share thoughts without converting them to text?

Can multi-agent systems exchange information through continuous hidden representations instead of language? This matters because text serialization loses information and slows inference.

Specialized Multi-Agent Applications

1 note

Can specialized agents write better scientific papers than single models?

Multi-agent frameworks decompose writing into specialized subtasks. This explores whether distributed agents maintaining cross-document consistency outperform single-model approaches on manuscript quality and literature synthesis.

Cooperation and Delegation

3 notes

Can agents learn cooperation by adapting to diverse partners?

Explores whether sequence model agents can develop mutual cooperation strategies through in-context learning when trained against varied co-players, without explicit cooperation mechanisms or hardcoded assumptions.

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Why do protocol-based tool systems fail in production agentic workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines reliable agent execution, and what causes ambiguous tool selection in production systems.

Related Areas

7 notes

How does test-time scaling work for individual research agents?

Can search budget follow the same scaling curves as reasoning tokens in agentic systems? This explores whether deep research exhibits test-time scaling laws similar to reasoning, with implications for inference-compute tradeoffs.

How does test-time scaling work at the agent level?

This explores how agents can spend compute at inference time across reasoning, interaction, and coordination. It examines whether multi-agent systems succeed through intelligent coordination or simply through token spending.

How do reasoning models actually fail under pressure?

This explores where reasoning models break down—whether through adversarial attacks, social reasoning gaps, or unfaithful traces that resist monitoring. Understanding failure modes reveals what these systems genuinely can and cannot do.

How well do reward models actually evaluate reasoning?

Can systems that judge AI reasoning be trusted to work reliably, or do they fail in systematic ways? This matters because flawed evaluators can't improve the systems they train.

How does test-time scaling work at the agent level?

This explores how agents can spend compute at inference time across reasoning, interaction, and coordination. It examines whether multi-agent systems succeed through intelligent coordination or simply through token spending.

How should we allocate compute budget at inference time?

Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.