Agentic Systems and Planning Reasoning and Knowledge Reasoning and Learning Architectures

Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Note · 2026-05-18 · sourced from Flaws

Delegation requires trust — the expectation that an LLM will execute a task without introducing errors. DELEGATE-52 stress-tests that expectation with 310 work environments across 52 domains (coding, crystallography, music notation, genealogy) and a round-trip relay protocol where each task is paired with its inverse, so a perfect model would recover the original document exactly.

Across 19 LLMs, even frontier systems (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Weaker models fail more severely. The degradation curve decelerates but does not plateau — the first half of an extended relay accounts for 2-3x more loss than the second half, yet the strongest model still drops below 60% accuracy by round-trip 50. Distractor files, longer documents, and longer interactions all worsen the rate.

The structural problem: errors are sparse but severe and they compound silently. A user reviewing one or two outputs sees competent work. A user delegating an end-to-end workflow gets a document that looks intact but contains accumulated drift in places they did not check. The trust assumption that holds at single-step interaction collapses at the timescale where delegation is actually valuable.

This is not a "weak model" finding. It is a ceiling on delegated work at the current frontier — one that scales unfavorably with exactly the workflow length that makes delegation attractive.

Related concepts in this collection

Concept map

14 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Do frontier LLMs silently corrupt documents in l… Do frontier models fail differently than weaker mo… Can better tools fix LLM document editing errors? Do short benchmarks predict how models perform ove… Do models fail worse when their own errors fill th… Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

frontier LLMs silently corrupt 25 percent of document content over long delegated workflows without plateauing

Do frontier LLMs silently corrupt documents in long workflows?

Related concepts in this collection

Related papers in this collection