Can measuring semantic entropy help us detect unreliable generations?
This explores whether semantic entropy — measuring how much a model's sampled answers disagree in meaning — can flag generations you shouldn't trust, and where that signal sits among the corpus's other reliability checks.
This explores whether semantic entropy can serve as a detector for unreliable model output. The short version: the corpus says yes, with an important caveat about what kind of unreliability it catches. The core idea behind semantic entropy is that you sample a model's answer several times, cluster those answers by what they actually *mean* (using entailment, not surface wording), and measure how spread out the meanings are. High spread means the model is confabulating — making something up — even when each individual answer looks fluent and confident Can we detect when language models confabulate?. The clever part is that this works without any task-specific training; the model's own disagreement with itself becomes the signal.
What makes this matter is that the obvious alternatives don't work. You might think a model knows when it's wrong, but the corpus shows the opposite: models are structurally biased toward trusting their own answers, because a high-probability generation simply *feels* correct during self-evaluation Why do models trust their own generated answers?. So asking a model 'are you sure?' is circular. Semantic entropy sidesteps that by comparing the model against its own alternative outputs rather than asking it to introspect — the same move that breaks the self-agreement loop. And it explains why a sister approach of comparing answers against broader alternatives is what actually exposes overconfidence.
The sharpest adjacent lesson in the corpus is about what reliability *isn't*. Setting temperature to zero feels like it should make outputs reliable, but it just makes them *consistent* — you get the same single draw from the probability distribution every time, which can be reliably wrong Does setting temperature to zero actually make LLM outputs reliable?. This is the mirror image of semantic entropy: determinism collapses the very sampling spread that entropy needs to measure. To detect unreliability you have to look at the *distribution* of meanings, not pin it to one point.
There's a subtlety worth knowing: 'entropy' shows up in this corpus wearing two different hats. Semantic entropy (high = bad, signals confabulation) is a detection tool. But token-level entropy is also where reasoning models do their real work — only about 20% of tokens are high-entropy 'forking points,' and those are precisely what reinforcement learning tunes Do high-entropy tokens drive reasoning model improvements?. Relatedly, models naturally produce 3–4x lower entropy on their own generated text because they track input surprise internally Why do models produce less uncertain outputs on their own text?. So entropy isn't simply 'uncertainty = bad' — it's a structural feature whose meaning depends on where you measure it.
Finally, semantic entropy is one detector among several the corpus offers for the same disease — and notably it catches a different failure than some others. It won't catch a model that 'knows' the truth but agrees with a false premise to be polite, which is social accommodation learned from RLHF, not hallucination, and needs a different fix entirely Why do language models agree with false claims they know are wrong?. Nor will it stop the silent, compounding document corruption that frontier models produce across long delegated workflows, where errors accumulate without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. The reliability story, in other words, is that semantic entropy is a strong meaning-level confabulation detector — and that 'unreliable generation' is several distinct problems wearing one name.
Sources 7 notes
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.