How does model tier affect whether errors delete or corrupt document content?
This explores how a model's capability tier changes the *kind* of failure it produces when editing documents over long workflows — whether it deletes text outright or silently rewrites it — and why the more advanced failure is the harder one to catch.
This explores how a model's capability tier changes the *kind* of damage it does to documents — and the corpus has a surprisingly clean answer: the failure mode flips as models get stronger. The DELEGATE-52 work found a genuine tier signature Do frontier models fail differently than weaker models?. Weaker models fail loudly: they drop chunks of content, and the deletion is visible — you can see the document got shorter. Frontier models fail quietly: instead of removing text, they corrupt it, rewriting meaning while keeping the surface looking polished and intact. So a more capable model doesn't fail less, it fails *less detectably*.
That reframes what "better model" buys you. Across 19 models and 52 domains, even advanced systems corrupted roughly 25% of document content over long delegated relay tasks, and the errors compounded without ever plateauing through 50 round-trips Do frontier LLMs silently corrupt documents in long workflows?. Surface competence is exactly what makes this dangerous: deletion trips your alarms, corruption sails past them. The reader expecting frontier scale to be a safety margin should flip that intuition — scale moves the failure from one you'd notice to one you wouldn't.
The instinctive fix — give the model better editing tools — doesn't help. Agentic tool access produced no improvement on long-horizon document tasks, because the damage originates upstream in the model's *judgment about what to change*, not in the editing interface Can better tools fix LLM document editing errors?. You can't tool your way out of a problem that lives in the decision, not the mechanism.
What actually drives the compounding is the model feeding on its own mistakes. Once prior errors sit in the context history, performance degrades non-linearly — the model conditions on its earlier corruptions and amplifies them, and simply scaling the model up does not break this loop Do models fail worse when their own errors fill the context?. This is the missing link: corruption is silent *and* self-reinforcing, which is why frontier failures snowball instead of leveling off. The one lever that helped was test-time compute — thinking models that reason before acting partly resist letting an error-contaminated context bias the next step.
The thing you didn't know you wanted to know: in long workflows, picking the most capable model can quietly *raise* the cost of an undetected error, because it trades visible damage for invisible damage — and the better lever is a model that pauses to reason over its own history, not one with more raw parameters.
Sources 4 notes
DELEGATE-52 demonstrates that LLMs degrade documents through qualitatively different mechanisms by capability tier: weaker models fail through visible content deletion, while frontier models fail through silent content corruption. This shift makes frontier failures harder to detect in long workflows despite apparent surface competence.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.