Language Understanding and Pragmatics LLM Reasoning and Architecture

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Note · 2026-02-23 · sourced from MechInterp

Standard entropy estimation for LLM outputs is misleading because the same correct answer can be expressed in many syntactically different ways, inflating apparent uncertainty. Semantic entropy solves this by operating at the level of meaning rather than tokens.

The method: sample multiple answers to a question, cluster them by bidirectional entailment (if A entails B and B entails A, they share a semantic cluster), then compute entropy over the clusters. High semantic entropy — many incompatible meaning clusters — signals confabulation. Low semantic entropy — answers converge on the same meaning despite different wording — signals reliability.

Key properties:

Works across datasets and tasks without a priori knowledge of the task
Requires no task-specific data
Robustly generalizes to unseen tasks
Significantly improves question-answering accuracy by identifying when to trust the model

The paper draws a precise distinction: not all hallucinations are confabulations. Confabulations are "arbitrary and incorrect generations" — outputs where the model could have generated different (and incompatible) answers with equal probability. Semantic entropy detects this specific failure mode: inconsistency at the meaning level.

This is practically valuable because it is self-referential — the model's own output distribution provides the uncertainty signal, requiring no external ground truth. When a model confabulates, it typically does so inconsistently across samples: different runs produce semantically incompatible answers. This inconsistency, invisible at the token level, becomes measurable at the semantic level.

Source: MechInterp

Related concepts in this collection

Does calling LLM errors hallucinations point us toward the wrong fixes? Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.
semantic entropy operationalizes the detection of one class of fabrication: semantically inconsistent generation
Can model confidence work as a reward signal for reasoning? Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
semantic entropy is an alternative confidence signal; both use self-referential measures but semantic entropy operates over sampled outputs rather than internal probabilities
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration and confabulation detection are related: well-calibrated models should have lower semantic entropy on questions they answer correctly

Concept map

15 direct connections · 156 in 2-hop network ·dense cluster

Can we detect when language models confabulate? Does calling LLM errors hallucinations point us to… Can model confidence work as a reward signal for r… Does binary reward training hurt model calibration…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

semantic entropy detects confabulations by computing uncertainty over meanings rather than tokens