Can we detect when language models confabulate?
Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
Standard entropy estimation for LLM outputs is misleading because the same correct answer can be expressed in many syntactically different ways, inflating apparent uncertainty. Semantic entropy solves this by operating at the level of meaning rather than tokens.
The method: sample multiple answers to a question, cluster them by bidirectional entailment (if A entails B and B entails A, they share a semantic cluster), then compute entropy over the clusters. High semantic entropy — many incompatible meaning clusters — signals confabulation. Low semantic entropy — answers converge on the same meaning despite different wording — signals reliability.
Key properties:
- Works across datasets and tasks without a priori knowledge of the task
- Requires no task-specific data
- Robustly generalizes to unseen tasks
- Significantly improves question-answering accuracy by identifying when to trust the model
The paper draws a precise distinction: not all hallucinations are confabulations. Confabulations are "arbitrary and incorrect generations" — outputs where the model could have generated different (and incompatible) answers with equal probability. Semantic entropy detects this specific failure mode: inconsistency at the meaning level.
This is practically valuable because it is self-referential — the model's own output distribution provides the uncertainty signal, requiring no external ground truth. When a model confabulates, it typically does so inconsistently across samples: different runs produce semantically incompatible answers. This inconsistency, invisible at the token level, becomes measurable at the semantic level.
Source: MechInterp
Related concepts in this collection
-
Does calling LLM errors hallucinations point us toward the wrong fixes?
Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.
semantic entropy operationalizes the detection of one class of fabrication: semantically inconsistent generation
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
semantic entropy is an alternative confidence signal; both use self-referential measures but semantic entropy operates over sampled outputs rather than internal probabilities
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration and confabulation detection are related: well-calibrated models should have lower semantic entropy on questions they answer correctly
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
semantic entropy detects confabulations by computing uncertainty over meanings rather than tokens