Language Understanding and Pragmatics LLM Reasoning and Architecture

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Standard entropy estimation for LLM outputs is misleading because the same correct answer can be expressed in many syntactically different ways, inflating apparent uncertainty. Semantic entropy solves this by operating at the level of meaning rather than tokens.

The method: sample multiple answers to a question, cluster them by bidirectional entailment (if A entails B and B entails A, they share a semantic cluster), then compute entropy over the clusters. High semantic entropy — many incompatible meaning clusters — signals confabulation. Low semantic entropy — answers converge on the same meaning despite different wording — signals reliability.

Key properties:

The paper draws a precise distinction: not all hallucinations are confabulations. Confabulations are "arbitrary and incorrect generations" — outputs where the model could have generated different (and incompatible) answers with equal probability. Semantic entropy detects this specific failure mode: inconsistency at the meaning level.

This is practically valuable because it is self-referential — the model's own output distribution provides the uncertainty signal, requiring no external ground truth. When a model confabulates, it typically does so inconsistently across samples: different runs produce semantically incompatible answers. This inconsistency, invisible at the token level, becomes measurable at the semantic level.


Source: MechInterp

Related concepts in this collection

Concept map
15 direct connections · 156 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

semantic entropy detects confabulations by computing uncertainty over meanings rather than tokens