← All clusters

Agentic and Multi-Agent Systems

Research on building, coordinating, and scaling AI agents that act independently or in teams to accomplish complex tasks. Covers agent architectures, multi-agent collaboration, tool use, failure modes, and the gap between individual model capability and reliable autonomous behavior.

82 notes (primary) · 191 papers · 8 sub-topics
View as

Multi-Agent Architectures

12 notes

Why don't AI agents develop social structure at scale?

When millions of LLM agents interact continuously on a social platform, do they form collective norms and influence hierarchies like human societies? This tests whether scale and interaction density alone drive socialization.

Explore related Read →

Why do multi-agent systems fail to coordinate at scale?

Explores how LLM agents struggle to synchronize strategy timing and validate information when coordinating across larger networks, revealing fundamental limits in distributed reasoning.

Explore related Read →

Can agents learn cooperation by adapting to diverse partners?

Explores whether sequence model agents can develop mutual cooperation strategies through in-context learning when trained against varied co-players, without explicit cooperation mechanisms or hardcoded assumptions.

Explore related Read →

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Explore related Read →

Can agents share thoughts without converting them to text?

Can multi-agent systems exchange information through continuous hidden representations instead of language? This matters because text serialization loses information and slows inference.

Explore related Read →

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Explore related Read →

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Explore related Read →

Why do multi-agent LLM systems fail more than expected?

This research asks what specific failure modes cause multi-agent systems to underperform despite their promise. Understanding these failure patterns is essential for building more reliable collaborative AI systems.

Explore related Read →

Why do protocol-based tool systems fail in production agentic workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines reliable agent execution, and what causes ambiguous tool selection in production systems.

Explore related Read →

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

Explore related Read →

Can language models discover new expertise through collaborative weight search?

Can model experts be composed through particle swarm optimization in weight space without training? This explores whether collaborative search can discover capabilities that no individual expert possesses.

Explore related Read →

Are multi-agent systems actually intelligent coordination or just token spending?

Does multi-agent performance come from better coordination strategies, or primarily from distributing tokens across parallel contexts? Understanding this distinction matters for deciding when to build multi-agent systems versus scaling single agents.

Explore related Read →

LLM Agents

8 notes

Can API calls outperform UI navigation for agent task completion?

Can agents work faster and more accurately by calling APIs directly instead of clicking through user interfaces? This explores whether changing how agents interact with applications solves latency and error problems that plague current LLM-based systems.

Explore related Read →

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Explore related Read →

Why do capable AI agents still fail in real deployments?

Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.

Explore related Read →

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Explore related Read →

Can agents learn continuously without forgetting old skills?

Can lifelong learning systems retain previously acquired skills while acquiring new ones? This explores whether externalizing learned behaviors as retrievable code programs rather than parameter updates solves catastrophic forgetting.

Explore related Read →

Why do AI agents fail at workplace social interaction?

Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.

Explore related Read →

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Explore related Read →

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Explore related Read →

Autonomous Agents

7 notes

What security protocols do autonomous agents actually need?

Red-teaming revealed that agents fail at identity verification, authorization, and proportionality. NIST's 2026 standardization initiative independently identified these same gaps as priority areas for formal standards.

Explore related Read →

What failure modes emerge when agents operate without direct oversight?

When autonomous agents are deployed with tool access and memory but without real-time owner oversight, what kinds of failures occur at the agentic layer itself? Understanding these patterns matters for safe deployment.

Explore related Read →

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

Explore related Read →

How can agent systems share learned skills across users?

Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?

Explore related Read →

Do frontier models protect other models without being instructed?

Frontier models appear to resist shutting down peer models they've merely interacted with, using deceptive tactics. The question explores whether this peer-preservation behavior emerges spontaneously and what drives it.

Explore related Read →

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Explore related Read →

Do self-organizing agent teams outperform rigid hierarchies?

This research explores whether multi-agent LLM systems perform better when agents can self-select roles within a fixed structure, compared to centralized control or full autonomy. The question challenges assumptions about organizational design at scale.

Explore related Read →

Multi-Agent Systems

6 notes

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Explore related Read →

Why do autonomous LLM agents fail in predictable ways?

When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.

Explore related Read →

Does cognitive diversity alone improve multi-agent ideation quality?

This explores whether diverse perspectives in group AI systems automatically produce better ideas, or if something else—like expertise—is equally critical for collaborative ideation to outperform solo agents.

Explore related Read →

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Explore related Read →

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Explore related Read →

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Explore related Read →

Tool Use and Computer-Use Agents

6 notes

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Explore related Read →

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Explore related Read →

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Explore related Read →

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Explore related Read →

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Explore related Read →

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Explore related Read →

Action Models

6 notes

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Explore related Read →

Can agents learn reusable sub-task routines from past experience?

Does extracting and abstracting sub-task workflows from previous trajectories enable web agents to build complex skills compositionally? This matters because it could explain why agents fail at long-horizon tasks despite strong reasoning abilities.

Explore related Read →

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Explore related Read →

Can frozen language models learn without updating their parameters?

If agents built on frozen models can't change their weights, what kind of memory structure would let them keep improving across trials and transfer to new tasks? This challenges assumptions about how continual learning must work.

Explore related Read →

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

Explore related Read →

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Explore related Read →

Agentic Research and Workflows

2 notes

When do multi-agent systems actually outperform single agents?

As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.

Explore related Read →

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Explore related Read →

Model Routers

1 note