What distinguishes intrinsic hallucination from extrinsic hallucination patterns?

This explores the classic split between hallucinations that contradict a given source (intrinsic) and those that add facts the source can't support (extrinsic) — but the corpus mostly reframes that surface taxonomy in terms of what causes each pattern.

The textbook distinction is about where the error lives relative to the input: an *intrinsic* hallucination contradicts the material the model was given, while an *extrinsic* one invents content that simply isn't in the source and can't be checked against it. Worth saying up front: the collection doesn't dwell on those two labels directly. What it does instead is more useful — it locates the *mechanisms* behind each pattern, which is where the distinction actually pays off.

The cleanest mechanistic version of the split comes from how models represent familiar vs. unfamiliar material. Networks build dense, confident activations for things they saw often in training and fall back to sparse representations for inputs they don't recognize Is representational sparsity learned or intrinsic to neural networks?. That maps neatly onto the extrinsic case: when a model is asked about a rare entity or an unseen *combination* of entities, it has no grounded representation to draw on and confabulates to fill the gap. A complementary note shows you can predict this from the data side — entity co-occurrence statistics in the pretraining corpus flag hallucination risk even when the model reports high confidence, catching the root cause (the combination was never seen) rather than the symptom (low confidence) Can pretraining data statistics detect hallucinations better than model confidence?. Models even carry an internal 'do I know this entity?' signal that steers them toward either answering or refusing Do models know what they don't know?.

Where this gets interesting is that the corpus argues the intrinsic/extrinsic taxonomy isn't fine-grained enough. One note isolates a third pattern entirely: *prompt-induced* hallucination, where a model is asked to fuse two semantically distant concepts and, rather than flag the fusion as illegitimate, produces an elaborate, plausible framework presented as real research Do language models evaluate semantic legitimacy when fusing concepts?. That isn't contradicting a source (not intrinsic) and isn't quite inventing a fact (not classically extrinsic) — it's a failure to evaluate whether a request is even coherent. Fact-checking taxonomies built around source-faithfulness miss it completely.

A stronger line in the collection questions the whole framing. Two notes argue these are all *fabrications*, not hallucinations — because accurate and inaccurate outputs come from the identical statistical token-prediction process, with no perception or memory step that 'goes wrong' Should we call LLM errors hallucinations or fabrications? Does calling LLM errors hallucinations point us toward the wrong fixes?. From that angle, intrinsic vs. extrinsic describes *where the output lands relative to a reference*, not two different things happening inside the model. That distinction matters for fixes: if you think it's a perception error you reach for grounding; if you accept it's fabrication you reach for verification and calibrated uncertainty.

The practical upshot, then, splits by which pattern you're fighting. Extrinsic-style fabrication — inventing unsupported content — is the one external grounding addresses well: interleaving reasoning with real-world lookups (a Wikipedia query, a tool call) injects ground truth at each step and cuts error propagation sharply Can interleaving reasoning with real-world feedback prevent hallucination?. Intrinsic contradictions, by contrast, are harder to grind out from inside the model — and one note proves the ceiling is real: hallucination is formally inevitable for any computable LLM, so no internal mechanism fully eliminates it and external safeguards aren't optional Can any computable LLM truly avoid hallucinating?. The thing you didn't know you wanted to know: the intrinsic/extrinsic line is less a property of the error and more a choice of which reference you're measuring against — and that choice silently decides whether you'll try to fix it with grounding or with verification.

Sources 8 notes

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do language models evaluate semantic legitimacy when fusing concepts?

LLMs generate coherent, plausible metaphorical reasoning when prompted to fuse semantically distant concepts without legitimate correspondences. Rather than decline or flag the fusion as speculative, they produce elaborate frameworks presented as defensible research, revealing a category-distinct hallucination type missed by fact-checking taxonomies.

Should we call LLM errors hallucinations or fabrications?

LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.

Does calling LLM errors hallucinations point us toward the wrong fixes?

LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

What distinguishes intrinsic hallucination from extrinsic hallucination patterns?

Sources 8 notes

Next inquiring lines