LLMs Corrupt Your Documents When You Delegate

Paper · arXiv 2604.15597
LLM Failure ModesTool Use and Computer-Use AgentsLLM AgentsLLM Evaluations and Benchmarks

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust—the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

The viability of delegated work hinges on LLMs' ability to carry out tasks and manipulate domain documents without introducing errors. The first contribution of our work is DELEGATE-52, a benchmark with 310 work environments across 52 professional domains, including coding, genealogy, and music sheet notation. Each environment consists of real documents totaling around 15k tokens in length, and 5-10 complex editing tasks that a user might ask an LLM to carry out. Our second contribution is the round-trip relay simulation method, which enables us to simulate long-horizon delegated interaction and evaluate LLM performance without requiring annotation or reference solutions. Specifically, we assume every editing task is reversible, defined by a forward instruction and its inverse. Applying both in sequence forms a backtranslation round-trip that, under a perfect model, recovers the original documents exactly.

We find that degradations continue to accumulate in longer relays, with none of the models showing signs of plateauing. The rate of degradation decelerates: the first half of the extended relay (round-trips 5–25) accounts for roughly 2–3x more loss than the second half (25–50), but even the strongest model (GPT 5.4) drops below 60% by the end of a 50-round-trip relay. As we extend relays from 10 to 50 round-trips, performance continues to degrade, with models introducing novel errors even when tasks repeat. We find that models perform better in programmatic domains (Python, DBSchema) compared to natural language and niche domains (e.g., earning statements, music notation). Weaker models' degradation originates primarily from content deletion, while frontier models' degradation is attributable to corruption of content.

Our simulation experiments provide several underexplored research directions that warrant more attention. First, model performance in short interaction is not always predictive of long-horizon performance, and studying model capabilities for long interaction (beyond memory management) is essential to understanding readiness for realistic delegated workflows. Second, the community at times frames "agent benchmarks" and "LLM benchmarks" as separate fields, but they should be seen as two modes of operations to accomplish tasks: when benchmarking an LLM, we need to consider various modes of operations of the LLM to better understand its capabilities and limitations. When delegating work to AI systems, users of LLMs should be cautious not to generalize the capabilities of the LLM in one domain to other domains. Model capabilities follow a jagged frontier, with models exhibiting strong (and sometimes surprising) performance at certain tasks, while making severe errors in others.

Error Propagation in Multi-Agent Systems. In recent years, multi-agent systems comprised of multiple interacting LLMs have seen a rise in attention (Guo et al., 2024). As is shown in (Hammond et al., 2025) the safety of MAS systems is critical. This is especially true due to the multitude of applications of these systems in finance (Xiao et al., 2025), programming (Hong et al., 2024), or more critical domains such as the energy sector or defence, as discussed in (Hammond et al., 2025). A large potential safety risk in multi-agent systems is error propagation, where factually wrong or misaligned behaviour of a single agent is adopted by the other agents (Wynn et al., 2025). In this paper, we focus on the case where the errors are due to an adversarial attack on one or more agents of the network, excluding errors introduced by e.g. hallucination. How and when propagation happens depends on both the concrete attack and the chosen topology of the system (Huang et al., 2025), where densely connected topologies tend to propagate errors less (Shen et al., 2025).

Adversarial Attacks on LLMs and Multi-Agent Systems. While the choice of topology plays an important role in error propagation (Shen et al., 2025), the specific attack does too (Huang et al., 2025). Firstly, there exists prior work on prompt sensitivity (Zhuo et al., 2024; Ismithdeen et al., 2025; Sclar et al., 2023), showing that prompt design can drastically change the behaviour of LLMs, opening the door to prompt based attacks. For this, in both the pure user-LLM case and the multi-agent scenario, an extensive number of possible attacks exists (deWitt, 2025). Both black and white box jailbreak attacks have been studied (Yi et al., 2024) and also applied to the multi-agent case (Men et al., 2025; Rahman et al., 2025; Shahroz et al., 2025). In particular, prompt injections are a relevant way to jailbreak LLMs (Liu et al., 2025a; Rossi et al., 2024) due to their ease of use, as they are completely black box. Defence mechanisms against prompt injections include the detection of malicious content in the prompts (Chennabasappa et al., 2025; Jacob et al., 2025; Hung et al., 2025). Recent work in this vein also explored completely non-understandable prompt injections (Cherepanova & Zou, 2024) that would fit the adversarial prompting case for user-LLM interactions from Figure 1. As a slightly less strong case of adversarial prompting, we have stealthy prompt injection methods, developed for the user-LMM case, which are suffix based (Liu et al., 2024; Mu et al., 2025). These attacks are similar to our setting, where bias transfer happens subliminally through unrelated tokens. We, too, conceal the true motive of our prompts, however in the stealthy case (Liu et al., 2024; Mu et al., 2025) the prompts are still partly human understandable due to only the suffix of the prompt being semantically unrelated. Standard defence techniques against such adversarial prompting include rephrasing of the question (Liu et al., 2025b). For MAS specifically, distributed attacks are a threat (Shahroz et al., 2025), exploiting weaknesses of distributed systems through e.g. man in the middle attacks (He et al., 2025).