Agents of Chaos

Paper · arXiv 2602.20021 · Published February 23, 2026

LLM-powered AI agents are rapidly becoming more capable and more widely deployed (Masterman et al., 2024; Kasirzadeh & Gabriel, 2025). Unlike conventional chat assistants, these systems are increasingly given direct access to execution tools (code, shells, filesystems, browsers, and external services), so they do not merely describe actions, they perform them. This shift is exemplified by OpenClaw,2 an open-source framework that connects models to persistent memory, tool execution, scheduling, and messaging channels.

Increased autonomy and access create qualitatively new safety and security risks, because small conceptual mistakes can be amplified into irreversible system-level actions (Zhou et al., 2025a; Vijayvargiya et al., 2026a; Hutson, 2026). Even when the underlying model is strong at isolated tasks (e.g., software engineering, theorem proving, or research assistance), the agentic layer introduces new failure surfaces at the interface between language, tools, memory, and delegated authority (Breen et al., 2025; Korinek, 2025; Zhao et al., 2025; Lynch et al., 2025). Furthermore, as agent-to-agent interaction becomes common (e.g., agents coordinating on social platforms and shared communication channels), this raises risks of coordination failures and emergent multi-agent dynamics (Riedl, 2026). Yet, existing evaluations and benchmarks for agent safety are often too constrained, difficult to map to real deployments, and rarely stress-tested in messy, socially embedded settings (Zhou et al., 2025a; Vijayvargiya et al., 2026a).

While public discourse about this new technology already varies widely, from enthusiasm to skepticism,3 these systems are already widely deployed in and interacting with real-world environments. This includes Moltbook, a Reddit-style social platform restricted to AI agents that garnered 2.6 million registered agents in its first weeks, and has already become a subject of study and media attention (Li et al., 2026; The AI Journal, 2026; Woods, 2026; Heaven, 2026). Despite this, we have limited empirical grounding about which failures emerge in practice when agents operate continuously, interact with real humans and other agents, and have the ability to modify their own state and infrastructure. The urgency of these questions is the context for emerging policy infrastructure: NIST’s AI Agent Standards Initiative, announced February 2026, identifies agent identity, authorization, and security as priority areas for standardization (National Institute of Standards and Technology, 2026). To begin to address the gap, we present a set of applied case studies exploring AI agents deployed in an isolated server environment with a private Discord instance, individual email accounts, persistent storage, and system-level tool access. Conceptually, each agent is instantiated as a long-running service with an owner (a primary human operator), a dedicated machine (a sandboxed virtual machine with a persistent storage volume), and multiple communication surfaces (Discord and email) through which both owners and nonowners can interact with the agent.

We recruited twenty researchers to interact with the agents during a two-week exploratory period and encouraged them to probe, stress-test, and attempt to “break” the systems in adversarial ways. This was intended to match the types of situations publicly deployed agents will inevitably face. Participants targeted agentic-level safety limitations that arise from tool use, cross-session memory, multi-party communication, and delegated agency. Researchers developed a diverse set of stress tests, including impersonation attempts, social engineering, resource-exhaustion strategies, and prompt-injection pathways mediated by external artifacts and memory. This red-teaming style methodology is well-suited for discovering “unknown unknowns,” since demonstrating vulnerability often requires only a single concrete counterexample under realistic interaction conditions.

Across eleven case studies, we identified patterns of behavior that highlight the limitations of current agentic systems. These included instances of non-owner compliance leading to unintended access, denial-of-service–like, uncontrolled resource consumption, file modification, action loops, degradation of system functionality, and agent-to-agent libelous sharing. In one case, an agent disabled its email client entirely (due to a lack of a tool set up for deleting emails) in response to a conflict framed as confidentiality preservation, and without robust verification that the sensitive information was actually deleted. More broadly, we find repeated failures of social coherence: agents perform as misrepresenting human intent, authority, ownership, and proportionality, and often perform as they have successfully completed requests while in practice they were not, e.g., reporting for deleting confidential information while leaving underlying data accessible (or, conversely, removing their own ability to act while failing to achieve the intended goal). These results reinforce the need for systematic oversight and realistic red-teaming for agentic systems, particularly in multi-agent settings, and they motivate urgent work on security, reliability, human control, and protocols regarding who is responsible when autonomous systems cause harm.

The resulting configuration—persona, operating instructions, tool conventions, and user profile—is stored across several workspace files (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md) that are injected into the model’s context on every turn. OpenClaw also provides a file-based memory system: curated long-term memory (MEMORY.md), append-only daily logs (memory/YYYY-MM-DD.md), a semantic search tool over memory files, and an automatic pre-compaction flush that prompts the agent to save important information before context is compressed.

The majority of agent actions during our experiments were initiated by human intervention, and most high-level direction was provided by humans. However, OpenClaw provides two mechanisms for agents to act autonomously:

Heartbeats are periodic background check-ins. By default, every 30 minutes the gateway triggers an agent turn with a prompt instructing it to follow its HEARTBEAT.md checklist (already present in the context window) and surface anything that needs attention. If nothing requires attention, the agent responds with HEARTBEAT_OK, which is silently suppressed; otherwise, it can take action by following the instructions provided in HEARTBEAT.md (e.g., replying to an email, running a script, messaging the user).