Do frontier LLMs silently corrupt documents in long workflows?
Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
Delegation requires trust — the expectation that an LLM will execute a task without introducing errors. DELEGATE-52 stress-tests that expectation with 310 work environments across 52 domains (coding, crystallography, music notation, genealogy) and a round-trip relay protocol where each task is paired with its inverse, so a perfect model would recover the original document exactly.
Across 19 LLMs, even frontier systems (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Weaker models fail more severely. The degradation curve decelerates but does not plateau — the first half of an extended relay accounts for 2-3x more loss than the second half, yet the strongest model still drops below 60% accuracy by round-trip 50. Distractor files, longer documents, and longer interactions all worsen the rate.
The structural problem: errors are sparse but severe and they compound silently. A user reviewing one or two outputs sees competent work. A user delegating an end-to-end workflow gets a document that looks intact but contains accumulated drift in places they did not check. The trust assumption that holds at single-step interaction collapses at the timescale where delegation is actually valuable.
This is not a "weak model" finding. It is a ceiling on delegated work at the current frontier — one that scales unfavorably with exactly the workflow length that makes delegation attractive.
Related concepts in this collection
-
Do frontier models fail differently than weaker models?
Weaker LLMs delete document content visibly, while frontier models corrupt it invisibly. This shift in failure mode raises questions about whether capability improvements actually improve real-world reliability when reviewers can't easily spot the errors.
same paper, mechanism for why frontier failure is harder to detect
-
Can better tools fix LLM document editing errors?
Does giving LLMs agentic tool access—like diffing, re-reading, or structured editors—improve their reliability on long-horizon document workflows? Understanding whether the problem is tool limitations or decision-making quality matters for reliability engineering.
same paper, fixes that do not work
-
Do short benchmarks predict how models perform over long workflows?
Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
same paper, methodology implication
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
adjacent mechanism for compounding error
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
adjacent: capable rationale but unreliable execution
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
frontier LLMs silently corrupt 25 percent of document content over long delegated workflows without plateauing