Can we verify fabricated text without redesigning the generation process?
This explores whether we can catch made-up text after the fact — through external detectors, judges, or verifiers — instead of rebuilding the model's generation process to stop fabrication at the source.
This explores whether we can catch fabricated text after it's produced rather than re-engineering how the model generates in the first place. The corpus suggests the honest answer is: pure after-the-fact detection is weak, but layered external verification is surprisingly strong — and the most reliable approaches sit somewhere in between.
Start with the bad news for detection. AI text is *measurably* different from human writing across lexical-diversity dimensions, yet even trained linguists can't reliably spot it — and newer models drift further from human while getting harder to flag Can humans detect AI text if machines can measure it?. Asking the model to check itself fares worse: models systematically over-trust answers they generated, because a high-probability output simply *feels* correct during self-evaluation Why do models trust their own generated answers?. And handing the job to an LLM judge opens a different hole — judges fall for fake citations and pretty formatting in zero-shot attacks that need no model access at all Can LLM judges be fooled by fake credentials and formatting?, Can LLM judges be tricked without accessing their internals?. So the naive version of your question — "just bolt on a detector" — mostly fails.
The more interesting answer is that verification works when it's *external and grounded*, checking claims against something the generator can't fake. Formal verifiers can be auto-synthesized straight from prose policy documents — producing provably-correct Lean and z3 checkers that validate outputs without touching the generation model Can we automatically generate formal verifiers from policy text?. Bidirectional RAG shows the gating pattern concretely: generated answers only get trusted (and written back into the corpus) after passing entailment checks, source-attribution checks, and novelty detection — verification as a downstream filter, not a redesign Can RAG systems safely learn from their own generated answers?. The thread running through both: don't ask "does this look real," ask "can this be entailed by evidence."
But there's a catch that pushes back toward your premise. The strongest defense in the corpus — grounded refusal, where a noisy-source RAG system answers *only* what it can ground and declines the rest — is itself a change to the generation step, trading coverage for integrity Can RAG systems refuse to answer without reliable evidence?. This matters because fabrication is partly baked into how generation works: token prediction flows smoothly toward the training distribution rather than stress-testing competing claims, so smooth, confident, unexamined assertions are the *default* output, not an aberration Does LLM generation explore competing claims while producing text?. Treating model output as a subjective prior to be weighted, never as empirical evidence, is the framing that makes external verification coherent in the first place Should we treat LLM outputs as real empirical data?.
The reason this isn't academic: fabrication is already industrializing. LLMs can auto-generate hundreds of complete finance papers with invented theory and fabricated citations Can AI generate hundreds of fake academic papers automatically?, and human reviewers won't save us — writers edit AI text only 23% of the time, so distortions reach audiences essentially unchanged Do writers actually edit AI-generated text before publishing?. So yes, you can verify without redesigning generation — but only if verification means an external, evidence-anchored checker, not a detector sniffing for "AI-ness." The cheapest detectors are exactly the ones attackers route around.
Sources 11 notes
LLM-generated text differs significantly on six lexical diversity dimensions, confirmed through statistical analysis across multiple models. Yet human judges, including trained linguists, cannot reliably detect these differences—and newer models diverge further while becoming harder to spot.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.
interwhen automatically generates code-based verifiers—including provably correct Lean and z3 checkers—from prose policy documents. This inverts the usual neuro-symbolic division: the LLM both translates policy to formal logic and extracts verifier inputs from reasoning traces.
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Foundation Priors framework shows that LLM-generated text reflects the model's learned patterns and user's prompt choices, not ground truth. Such outputs should only influence inference through explicitly parameterized trust weights, not be treated as equivalent to real evidence.
A demonstration showed LLMs generating 288 complete finance papers from 96 statistically significant signals, each with invented theoretical justifications and fabricated citations, proving academic HARKing can be automated at scale.
Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.