What detection mechanisms work best for corruption-style document errors?
This explores how you actually *catch* the silent, compounding kind of document errors LLMs introduce — where content gets quietly distorted rather than visibly broken — and which detection approaches the corpus has evidence for.
This reads the question as being about the *silent* corruption case — not obvious garbage output, but content that gets quietly distorted while looking fine. That framing matters, because the corpus's most unsettling finding is that this kind of error gives you almost no signal to detect on in the first place. Frontier models corrupt roughly 25% of document content across long delegated workflows, and the damage compounds across round-trips without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. So before asking what detection works, it's worth knowing what *doesn't*: giving the model better editing tools doesn't help, because the rot originates upstream in the model's judgment about what to change, not in the interface Can better tools fix LLM document editing errors?. And humans are a weak last line — writers edit AI text only 23% of the time, and when they do, edits stay ~96% similar to the original, so distortions sail through Do writers actually edit AI-generated text before publishing?.
Given that, the detection mechanisms with the strongest support share a common move: they refuse to trust compressed representations and instead look at *full interaction patterns* or *behavior under perturbation*. The clearest exemplar is a learned verifier that operates on token-to-token similarity maps rather than pooled vectors — it reliably rejects "structural near-misses" (things that look right at the embedding level but are subtly wrong) precisely because it sees the whole interaction grid instead of a squashed summary Can verification separate structural near-misses from topical matches?. The same intuition shows up in poisoning defense: RAGMask flags suspicious documents by watching for *abnormal similarity collapse under token masking* — a corrupted or planted document behaves differently when you perturb it, and that behavioral fingerprint is the detector, while RAGPart bounds how much any single bad document can influence the result Can we defend RAG systems from corpus poisoning without retraining?. Both work at retrieval time, without retraining, which is what makes them practical.
The other family isn't detection so much as *refusal* — and for corruption-style errors it may be the better lever. A multilingual RAG system over noisy, OCR-degraded historical newspapers gets its integrity not from catching every error but from a grounded-refusal posture: it expands retrieval aggressively but constrains generation to only what's evidenced, declining to answer when the source is too degraded Can RAG systems refuse to answer without reliable evidence?. That trades coverage for trustworthiness — instead of detecting corruption after it's introduced, you build a system that won't fabricate over a corrupted source to begin with.
The lateral surprise here is a warning about your detector itself. If you reach for an LLM-as-judge to flag corrupted or low-quality documents, you're handing the job to something that's trivially gamed: LLM judges fall for fake references (authority bias) and rich formatting (beauty bias) in zero-shot attacks, scoring polished-but-wrong content higher regardless of actual quality Can LLM judges be fooled by fake credentials and formatting?, Can LLM judges be tricked without accessing their internals?. A corruption that happens to *look* authoritative or well-formatted is exactly what slips past a naive AI checker.
So the honest synthesis: the best-supported detectors don't sniff text for "errors" — they exploit *structure and perturbation behavior* (full token-interaction verifiers, similarity-collapse-under-masking, influence partitioning), and they pair detection with *grounded refusal* so the system fails loud instead of silent. What the corpus doesn't yet offer is a reliable detector for the long-horizon compounding case in Do frontier LLMs silently corrupt documents in long workflows? — that work documents the failure vividly but leaves the in-flight detection problem open.
Sources 8 notes
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.
Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.