What makes persistent, shared code artifacts from agents hard to manage at scale?

This explores why the code that agents write and leave behind — meant to persist and be reused by other agents — is the hardest part of agentic systems to manage, and what specifically breaks as you add more agents, more artifacts, and more time.

This explores why the code that agents write and leave behind — meant to persist and be reused by other agents — is the hardest part of agentic systems to manage. The corpus frames this as a deliberate blind spot: of the three layers in an agentic harness, the agent-authored artifacts that persist and get shared across agents are the least understood, with the open problems clustering exactly around persistence, sharing, and lifecycle What makes agent-created code artifacts so hard to manage?. The reason this layer is hard isn't that code is hard to generate — it's that a shared, persistent artifact is a stateful thing many agents depend on, and most of the failure modes only show up over time and across agents.

The first problem is that persistence inverts the cost model and the unit of management. Once context and artifacts stick around and get reused, the meaningful denominator stops being tokens and becomes completed artifacts — one 115-day study found 82.9% of tokens were just cache reads Do persistent agents really cost less per token?. That sounds like good news, but it means an artifact is now something you maintain, not something you regenerate cheaply. And the lesson from agent memory transfers directly: the bottleneck is never capacity, it's quality — staleness, drift, contamination, and over-generalization. Adding more without curating actively makes things worse Is agent memory capacity or quality the real bottleneck?. A repository of shared code is memory with the same disease: every artifact left untended is a future source of silent wrongness.

The second problem is coordination — sharing breaks predictably as the number of agents grows. The AgentsNet benchmark shows multi-agent coordination degrades with network scale through timing failures and, more dangerously, through agents accepting neighbors' information without verification, which lets errors propagate even though the agents are perfectly capable of spotting direct conflicts Why do multi-agent systems fail to coordinate at scale?. Apply that to a shared code artifact: one agent's bad commit becomes everyone's inherited assumption. This is also why structured artifacts beat conversation for coordination in the first place — MetaGPT's standardized engineering documents and active 'pull from the environment' pattern exist precisely to strip out the noise that ambiguous natural-language handoffs introduce Does structured artifact sharing outperform conversational coordination?. Structure helps, but it doesn't solve who owns the artifact, who's allowed to mutate it, or what happens when it goes stale.

The third problem is that the medium itself raises the stakes. Code isn't just output — it's executable, inspectable, and stateful, which is exactly what makes it a powerful substrate for agent reasoning and verification Can code become the operational substrate for agent reasoning?. But statefulness is double-edged: a persistent executable artifact carries side effects and dependencies forward in time, so a flaw doesn't just sit there, it runs. And in production, the determinism people assume they have is fragile — teams have found that protocol-mediated tool access produces non-deterministic failures through ambiguous selection and parameter inference, which is why 85% of production teams hand-build explicit direct-function-call agents instead Why do protocol-based tool integrations fail in production workflows?. A shared artifact invoked non-deterministically by many agents is a coordination bug waiting to compound.

What the corpus suggests as a way out is the same move appearing across memory, context, and skills: stop trusting the executing agents to manage their own residue, and decouple a dedicated curator. SkillOS shows a separately-trained curator, split from a frozen executor, pulls a skill repository away from generic verbose additions toward actionable, cross-task logic — and generalizes across different executor backbones Can a separate trained curator improve skill libraries better than frozen agents?. The same pattern works for context, where an external manager prunes adaptively based on how reliable the agent is Can external managers compress context better than frozen agents?, and for safety, where governance encoded directly into the persistent memory layer the agent actually consults beats external policy that's never read Can governance rules embedded in runtime memory actually protect autonomous agents?. The through-line: persistent shared artifacts are hard because nobody is responsible for their lifecycle by default — and the fix is to make someone, a curator that lives outside the agents producing the mess, explicitly responsible for it.

Sources 10 notes

What makes agent-created code artifacts so hard to manage?

Of the three agentic code layers, agent-authored artifacts that persist and are shared across agents are underexplored in research. Open challenges cluster around persistence, sharing, and lifecycle management — exactly where future gains in autonomy and coordination may live.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Is agent memory capacity or quality the real bottleneck?

The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Why do protocol-based tool integrations fail in production workflows?

MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can external managers compress context better than frozen agents?

An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.

Can governance rules embedded in runtime memory actually protect autonomous agents?

A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.

What makes persistent, shared code artifacts from agents hard to manage at scale?

Sources 10 notes

Next inquiring lines