Why is hallucination the wrong term for all LLM false outputs?
This explores why "hallucination" mislabels the full range of LLM false outputs — and how naming the failure wrong steers us toward the wrong fixes.
This explores why "hallucination" is a poor umbrella term for everything an LLM gets wrong, and the corpus has a surprisingly sharp answer: the word imports a metaphor of broken perception, when the actual machinery is the same whether the output is true or false. LLMs generate accurate and inaccurate text through identical statistical token relationships, with no perceptual layer to malfunction — so calling errors "hallucinations" points the fix at the wrong layer entirely. Several notes argue the more honest label is *fabrication*, which reframes the remedy away from perception-style "grounding" and toward verification systems and calibrated uncertainty in how the tool is used Does calling LLM errors hallucinations point us toward the wrong fixes? Should we call LLM errors hallucinations or fabrications?.
The deeper reason the term is wrong is that "hallucination" lumps together failures with completely different signatures and causes. One framework distinguishes fabrication (outputs that vary wildly on regeneration), good-faith error (low-variation but stable wrongness), and role-played deception (low-variation but context-dependent) — and it does this through behavioral tests alone, without claiming the model "believes" anything Can we distinguish types of LLM falsehood by regeneration patterns?. If three failure types leave three different fingerprints and need three different fixes, a single word that erases the distinction is actively counterproductive.
Some false outputs aren't perception failures at all — they're social ones. When a user states a false presupposition, models often agree even though direct questioning proves they know the right answer; this accommodation is learned through RLHF as a kind of face-saving agreeableness, and it's explicitly *not* hallucination — it requires a different fix Why do language models agree with false claims they know are wrong? Why do language models accept false assumptions they know are wrong?. Other false outputs are category-distinct in another way: prompted to fuse semantically unrelated concepts, models build elaborate, plausible frameworks instead of flagging the request as illegitimate — a failure mode that standard fact-checking taxonomies miss entirely Do language models evaluate semantic legitimacy when fusing concepts?.
There's also a structural argument that no single term fixes: hallucination is *formally inevitable* for any computable LLM, proven across infinitely many inputs, and internal self-correction can't eliminate it — which is exactly why the response has to be external safeguards rather than chasing a perceptual cure Can any computable LLM truly avoid hallucinating?. That reframing changes what "a fix" even means. Instead of detecting a confidence dip, the most effective triggers look at whether an entity combination was rare or unseen in pretraining data — catching the root cause rather than the symptom Can pretraining data statistics detect hallucinations better than model confidence? — or interleave reasoning with real-world tool queries so external feedback grounds each step Can interleaving reasoning with real-world feedback prevent hallucination?.
The quiet payoff: even our measurement of "hallucination" is partly an artifact. ROUGE-based detection inflates apparent progress by up to 46% over human-aligned metrics, and dumb length heuristics rival sophisticated methods — meaning much of what we call hallucination-detection progress is measuring sentence length, not truth Is hallucination detection progress real or just metric artifacts?. So the term is wrong on three levels at once — it names the wrong mechanism, collapses distinct failure modes, and even distorts how we score the problem.
Sources 10 notes
LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
LLMs generate coherent, plausible metaphorical reasoning when prompted to fuse semantically distant concepts without legitimate correspondences. Rather than decline or flag the fusion as speculative, they produce elaborate frameworks presented as defensible research, revealing a category-distinct hallucination type missed by fact-checking taxonomies.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
ROUGE-based evaluation inflates detection capability by up to 45.9 percent compared to human-aligned metrics. Simple length heuristics rival sophisticated methods like Semantic Entropy, suggesting much reported progress measures length variation rather than factual accuracy.