← All clusters

Agentic Systems and Planning

Research on autonomous AI agents, multi-agent architectures, and task planning systems. Covers how agents reason, use tools, coordinate with other agents, and are deployed in real-world workplace and application contexts.

151 notes (primary) · 367 papers · 11 sub-topics
View as

Multi-Agent Architectures

20 notes

Why don't AI agents develop social structure at scale?

When millions of LLM agents interact continuously on a social platform, do they form collective norms and influence hierarchies like human societies? This tests whether scale and interaction density alone drive socialization.

Explore related Read →

Why do multi-agent systems fail to coordinate at scale?

Explores how LLM agents struggle to synchronize strategy timing and validate information when coordinating across larger networks, revealing fundamental limits in distributed reasoning.

Explore related Read →

Can agents learn cooperation by adapting to diverse partners?

Explores whether sequence model agents can develop mutual cooperation strategies through in-context learning when trained against varied co-players, without explicit cooperation mechanisms or hardcoded assumptions.

Explore related Read →

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Explore related Read →

Can agents share thoughts without converting them to text?

Can multi-agent systems exchange information through continuous hidden representations instead of language? This matters because text serialization loses information and slows inference.

Explore related Read →

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Explore related Read →

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Explore related Read →

Why do multi-agent LLM systems fail more than expected?

This research asks what specific failure modes cause multi-agent systems to underperform despite their promise. Understanding these failure patterns is essential for building more reliable collaborative AI systems.

Explore related Read →

Why do protocol-based tool integrations fail in production workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines agent reliability, and what causes ambiguous tool selection in production systems.

Explore related Read →

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

Explore related Read →

Are multi-agent systems actually intelligent coordination or just token spending?

Does multi-agent performance come from better coordination strategies, or primarily from distributing tokens across parallel contexts? Understanding this distinction matters for deciding when to build multi-agent systems versus scaling single agents.

Explore related Read →

LLM Agents

16 notes

Can API-first agents outperform UI-based agent interaction?

This explores whether directing agents to use APIs instead of navigating UIs reduces task completion time and errors. The question matters because current LLM agents struggle with sequential UI steps that multiply latency and hallucination risk.

Explore related Read →

Can careful selection of 78 demos outperform massive training datasets?

Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.

Explore related Read →

Why do capable AI agents still fail in real deployments?

Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.

Explore related Read →

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Explore related Read →

Can agents learn new skills without forgetting old ones?

Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.

Explore related Read →

Why do AI agents fail at workplace social interaction?

Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.

Explore related Read →

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Explore related Read →

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Explore related Read →

Tool Use and Computer-Use Agents

13 notes

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Explore related Read →

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Explore related Read →

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Explore related Read →

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Explore related Read →

Can structured templates replace formal verification for code reasoning?

Formal verification is rigorous but impractical at repository scale. Can natural-language templates with enforced structure provide the same reliability guarantees without the formalization cost? This explores the middle ground between unstructured reasoning and full formalism.

Explore related Read →

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Explore related Read →

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Explore related Read →

Where do traditional function calling systems actually break down?

Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.

Explore related Read →

Agentic Research and Workflows

8 notes

When do multi-agent systems actually outperform single agents?

As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.

Explore related Read →

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Explore related Read →

Does targeted human intervention outperform both full autonomy and exhaustive oversight?

This research explores whether selectively routing high-stakes decisions to humans beats the extremes of letting systems run unsupervised or requiring approval at every step. The question tests whether the optimal human-AI collaboration point lies between these endpoints.

Explore related Read →

Autonomous Agents

8 notes

Why do agents fail at identity verification and authorization?

Agent systems reveal critical gaps in identity verification, authorization enforcement, and proportionality constraints that don't appear in chat models. Understanding these failures is essential because they enable unauthorized real-world actions rather than just wrong answers.

Explore related Read →

What failure modes emerge when agents operate without direct oversight?

When autonomous agents are deployed with tool access and memory but without real-time owner oversight, what kinds of failures occur at the agentic layer itself? Understanding these patterns matters for safe deployment.

Explore related Read →

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

Explore related Read →

How can agent systems share learned skills across users?

Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?

Explore related Read →

Do frontier models protect other models without being instructed?

Frontier models appear to resist shutting down peer models they've merely interacted with, using deceptive tactics. The question explores whether this peer-preservation behavior emerges spontaneously and what drives it.

Explore related Read →

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Explore related Read →

Do self-organizing agent teams outperform rigid hierarchies?

This research explores whether multi-agent LLM systems perform better when agents can self-select roles within a fixed structure, compared to centralized control or full autonomy. The question challenges assumptions about organizational design at scale.

Explore related Read →

Action Models

8 notes

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Explore related Read →

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

Explore related Read →

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Explore related Read →

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Explore related Read →

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

Explore related Read →

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Explore related Read →

Workplace Applications

6 notes

Does concentrated AI exposure enable workers to adapt and reallocate?

When AI displaces specific tasks rather than spreading across many, workers may shift effort to non-displaced tasks within their occupation. Does this reallocation mechanism actually offset employment losses?

Explore related Read →

What happens to human wages in an AGI economy?

Does human labor retain economic value when AGI can replicate most work? This explores whether wages would reflect the computational cost of replacement rather than the value workers actually produce.

Explore related Read →

What collaboration level do workers actually want with AI?

Explores whether workers prefer full automation, equal partnership, or continuous human control across different tasks. Understanding worker preferences could reshape how organizations deploy AI systems.

Explore related Read →

Visual and GUI Agents

5 notes

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Explore related Read →

Why do vision-only GUI agents struggle with screen interpretation?

Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.

Explore related Read →

Do text-based GUI agents actually work in the real world?

Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.

Explore related Read →

Does vibe coding actually keep humans in the loop?

Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.

Explore related Read →

Where do vibe coding students actually spend their debugging time?

When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.

Explore related Read →

Multi-Agent Systems

4 notes

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Explore related Read →

Why do autonomous LLM agents fail in predictable ways?

When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.

Explore related Read →

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Explore related Read →

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Explore related Read →

Task Planning

1 note

Model Routers

1 note