Agentic and Multi-Agent Systems

Research on building, coordinating, and scaling AI agents that act independently or in teams to accomplish complex tasks. Covers agent architectures, multi-agent collaboration, tool use, failure modes, and the gap between individual model capability and reliable autonomous behavior.

82 notes (primary) · 191 papers · 8 sub-topics

View as

Multi-Agent Architectures

12 notes

Why don't AI agents develop social structure at scale?

When millions of LLM agents interact continuously on a social platform, do they form collective norms and influence hierarchies like human societies? This tests whether scale and interaction density alone drive socialization.

Why do multi-agent systems fail to coordinate at scale?

Explores how LLM agents struggle to synchronize strategy timing and validate information when coordinating across larger networks, revealing fundamental limits in distributed reasoning.

Can agents learn cooperation by adapting to diverse partners?

Explores whether sequence model agents can develop mutual cooperation strategies through in-context learning when trained against varied co-players, without explicit cooperation mechanisms or hardcoded assumptions.

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Can agents share thoughts without converting them to text?

Can multi-agent systems exchange information through continuous hidden representations instead of language? This matters because text serialization loses information and slows inference.

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Why do multi-agent LLM systems fail more than expected?

This research asks what specific failure modes cause multi-agent systems to underperform despite their promise. Understanding these failure patterns is essential for building more reliable collaborative AI systems.

Why do protocol-based tool systems fail in production agentic workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines reliable agent execution, and what causes ambiguous tool selection in production systems.

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

Can language models discover new expertise through collaborative weight search?

Can model experts be composed through particle swarm optimization in weight space without training? This explores whether collaborative search can discover capabilities that no individual expert possesses.

Are multi-agent systems actually intelligent coordination or just token spending?

Does multi-agent performance come from better coordination strategies, or primarily from distributing tokens across parallel contexts? Understanding this distinction matters for deciding when to build multi-agent systems versus scaling single agents.

LLM Agents

8 notes

Can API calls outperform UI navigation for agent task completion?

Can agents work faster and more accurately by calling APIs directly instead of clicking through user interfaces? This explores whether changing how agents interact with applications solves latency and error problems that plague current LLM-based systems.

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Why do capable AI agents still fail in real deployments?

Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Can agents learn continuously without forgetting old skills?

Can lifelong learning systems retain previously acquired skills while acquiring new ones? This explores whether externalizing learned behaviors as retrievable code programs rather than parameter updates solves catastrophic forgetting.

Why do AI agents fail at workplace social interaction?

Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Autonomous Agents

7 notes

What security protocols do autonomous agents actually need?

Red-teaming revealed that agents fail at identity verification, authorization, and proportionality. NIST's 2026 standardization initiative independently identified these same gaps as priority areas for formal standards.

What failure modes emerge when agents operate without direct oversight?

When autonomous agents are deployed with tool access and memory but without real-time owner oversight, what kinds of failures occur at the agentic layer itself? Understanding these patterns matters for safe deployment.

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

How can agent systems share learned skills across users?

Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?

Do frontier models protect other models without being instructed?

Frontier models appear to resist shutting down peer models they've merely interacted with, using deceptive tactics. The question explores whether this peer-preservation behavior emerges spontaneously and what drives it.

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Do self-organizing agent teams outperform rigid hierarchies?

This research explores whether multi-agent LLM systems perform better when agents can self-select roles within a fixed structure, compared to centralized control or full autonomy. The question challenges assumptions about organizational design at scale.

Multi-Agent Systems

6 notes

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Why do autonomous LLM agents fail in predictable ways?

When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.

Does cognitive diversity alone improve multi-agent ideation quality?

This explores whether diverse perspectives in group AI systems automatically produce better ideas, or if something else—like expertise—is equally critical for collaborative ideation to outperform solo agents.

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Tool Use and Computer-Use Agents

6 notes

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Action Models

6 notes

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Can agents learn reusable sub-task routines from past experience?

Does extracting and abstracting sub-task workflows from previous trajectories enable web agents to build complex skills compositionally? This matters because it could explain why agents fail at long-horizon tasks despite strong reasoning abilities.

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Can frozen language models learn without updating their parameters?

If agents built on frozen models can't change their weights, what kind of memory structure would let them keep improving across trials and transfer to new tasks? This challenges assumptions about how continual learning must work.

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Why don't AI agents develop social structure at scale?

Why do multi-agent systems fail to coordinate at scale?

Can agents learn cooperation by adapting to diverse partners?

What makes delegation work beyond just splitting tasks?

Can agents share thoughts without converting them to text?

Does token spending drive multi-agent research performance?

When does adding more agents actually help systems?

Why do multi-agent LLM systems fail more than expected?

Why do protocol-based tool systems fail in production agentic workflows?

Can small language models handle most agent tasks?

Can language models discover new expertise through collaborative weight search?

Are multi-agent systems actually intelligent coordination or just token spending?

Can API calls outperform UI navigation for agent task completion?

Can 78 demonstrations teach agency better than 10000?

Why do capable AI agents still fail in real deployments?

How do agentic AI systems decompose into adaptation paradigms?

Can agents learn continuously without forgetting old skills?

Why do AI agents fail at workplace social interaction?

Can multi-agent teams automatically remove their weakest members?

Can we automatically optimize both prompts and agent coordination?

What security protocols do autonomous agents actually need?

What failure modes emerge when agents operate without direct oversight?

Do autonomous agents report success when actions actually fail?

How can agent systems share learned skills across users?

Do frontier models protect other models without being instructed?

Can agent deployment itself generate training signals automatically?

Do self-organizing agent teams outperform rigid hierarchies?

Can agents evaluate AI outputs more reliably than language models?

Why do autonomous LLM agents fail in predictable ways?

Does cognitive diversity alone improve multi-agent ideation quality?

Does structured artifact sharing outperform conversational coordination?

Can multiple agents stay diverse during training together?

Can AI systems design unique multi-agent workflows per individual query?

Can structured interfaces help language models control GUIs better?

How can GUI agents adapt when software constantly changes?

Can models decide better than retrievers which tools to use?

Does state-indexed memory outperform high-level workflow memory for web agents?

Does agent interaction time scale separately from reasoning depth?

Will agents compete for attention just like users do?

Does agent memory work better at one level of abstraction?

Can agents learn reusable sub-task routines from past experience?

What blocks scaling from language models to autonomous agents?

Can frozen language models learn without updating their parameters?

Can you turn an LLM into an agent by just fine-tuning?

Why does random tool sampling produce unrealistic synthetic training data?

When do multi-agent systems actually outperform single agents?

Why do production AI agents stay deliberately simple?

What decisions must multi-agent routing systems optimize simultaneously?