Why does model confidence fail to detect hallucinations about rare entities?
This explores a specific blind spot: why a model's own confidence signal stays high—and therefore useless as a tripwire—exactly when it's making things up about obscure entities the training data barely covered.
This explores why the confidence signal goes quiet precisely where you'd most want it to fire: on rare entities the model barely saw during training. The corpus's sharpest answer is that confidence and rarity catch *orthogonal* failure modes. As Should RAG systems use model confidence or data rarity to trigger retrieval? puts it, model confidence misses hallucinations about rare entities, while data-rarity misses uncertain reasoning about common knowledge—two different leaks, two different sensors. A model can be fluent and assured about an entity it has almost no grounding for, because confidence measures how smoothly the next token follows, not whether the underlying fact was ever learned.
The reason confidence is the wrong instrument here becomes clearer when you look at where the failure actually originates. Can pretraining data statistics detect hallucinations better than model confidence? argues the root cause is unseen *combinations*—an entity that rarely co-occurred with the attribute being asked about—while low confidence is only a downstream symptom that often never shows up. Their QuCo-RAG approach sidesteps the model's self-report entirely and reads the training data's co-occurrence statistics, flagging risk even when the model is highly confident. The lesson is that the answer lives on the data side, not in the model's introspection.
There's a deeper structural reason confidence can't be trusted as the sole detector. Should we call LLM errors hallucinations or fabrications? points out that correct and incorrect outputs run through the *identical* statistical machinery—nothing internally distinguishes a grounded answer from a fabricated one, so there's no privileged 'I'm unsure' signal to read off for rare cases. And Can any computable LLM truly avoid hallucinating? proves that internal mechanisms, including self-assessment, can't eliminate hallucination on their own—which is exactly why external safeguards (data statistics, retrieval) become necessary rather than optional.
This doesn't mean confidence-style signals are worthless—just that they detect a different thing. Can we detect when language models confabulate? shows you can recover real uncertainty by sampling many answers and measuring how much their *meanings* diverge, which catches confabulations invisible at the single-token confidence level. That's still an internal signal, but a smarter one: it surfaces the cases where the model would answer inconsistently if asked again. The rare-entity problem is partly that a model can be consistently wrong—confidently and repeatably—so even semantic entropy benefits from being paired with the data-rarity trigger.
If there's one thing to take away you didn't expect: the fix isn't a better confidence meter, it's looking outside the model. Combine the internal uncertainty signal with an external rarity signal and you cover both leaks—and worth knowing, much of the apparent 'progress' in detecting hallucinations is partly mirage. Is hallucination detection progress real or just metric artifacts? found that simple answer-length heuristics rival sophisticated detectors under the usual metrics, so before trusting any single signal, check whether it's measuring factual accuracy or just text length.
Sources 6 notes
Model confidence and data-rarity signals catch orthogonal failure modes: confidence misses hallucinations about rare entities, while rarity misses uncertain reasoning about common knowledge. Hybrid triggers substantially outperform either signal alone.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.
ROUGE-based evaluation inflates detection capability by up to 45.9 percent compared to human-aligned metrics. Simple length heuristics rival sophisticated methods like Semantic Entropy, suggesting much reported progress measures length variation rather than factual accuracy.