Why does model confidence fail to detect hallucinations on rare entity pairs?

This explores why a model's own confidence is a poor alarm for hallucinations involving rare entity combinations — and what the corpus suggests we use instead.

This explores why model confidence misses hallucinations specifically when they involve rare entity pairs, and the corpus points to a clean answer: confidence and rarity are measuring two different things. The most direct evidence is the finding that internal uncertainty signals and pretraining-rarity signals catch *orthogonal* failure modes — confidence reliably flags shaky reasoning about common knowledge, but goes quiet exactly when the model confronts a combination of entities it rarely or never saw together in training Should RAG systems use model confidence or data rarity to trigger retrieval?. The model isn't uncertain about the rare pair; it's confidently wrong, because nothing in its experience contradicts the fabrication.

The deeper reason surfaces in the reframing of what LLMs actually do. Accurate and inaccurate outputs are produced by the *identical* statistical token-prediction process — there's no separate 'truth-tracking' circuit that fires for facts and falters for fabrications Should we call LLM errors hallucinations or fabrications? Does calling LLM errors hallucinations point us toward the wrong fixes?. Confidence reflects how smoothly the next token fits the learned distribution, not whether the claim corresponds to reality. For a rare entity pair, a plausible-sounding bridge between two entities can be highly probable token-by-token while being entirely invented — high fluency, high confidence, zero grounding.

This is why the corpus argues for moving the detection signal *off* the model and onto the data. QuCo-RAG uses entity co-occurrence statistics from the training corpus to trigger retrieval, successfully flagging risk on unseen combinations even when the model reports high confidence — it catches the root cause (the combination was never seen) rather than the symptom (the model feels unsure) Can pretraining data statistics detect hallucinations better than model confidence?. Rarity is a property the model can't introspect about, so an external, data-side check sees what self-assessment structurally cannot.

Worth knowing: the limits here aren't just engineering gaps. Hallucination is formally inevitable for any computable LLM, and internal mechanisms like self-correction provably can't eliminate it — which is precisely why external safeguards like rarity-triggered retrieval aren't optional add-ons but necessary Can any computable LLM truly avoid hallucinating?. Meanwhile, confidence-based detectors aren't useless — semantic entropy, which clusters multiple sampled answers by meaning rather than reading raw token probability, catches confabulations that token-level confidence misses Can we detect when language models confabulate?. But even that operates on the model's behavior, not on whether the underlying entity pair was ever attested.

The takeaway a curious reader might not expect: the fix isn't a better confidence meter. The most robust systems hybridize — internal uncertainty for the 'uncertain reasoning about common facts' failures, external data-rarity for the 'confidently wrong about rare combinations' failures — because neither signal alone covers the space Should RAG systems use model confidence or data rarity to trigger retrieval?.

Sources 6 notes

Should RAG systems use model confidence or data rarity to trigger retrieval?

Model confidence and data-rarity signals catch orthogonal failure modes: confidence misses hallucinations about rare entities, while rarity misses uncertain reasoning about common knowledge. Hybrid triggers substantially outperform either signal alone.

Should we call LLM errors hallucinations or fabrications?

LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.

Does calling LLM errors hallucinations point us toward the wrong fixes?

LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Why does model confidence fail to detect hallucinations on rare entity pairs?

Sources 6 notes

Next inquiring lines