Can semantic entropy improve model calibration without external ground truth?

This explores whether a model's own uncertainty signals — entropy over what it generates — can be turned into a self-supervised calibration mechanism, so confidence tracks correctness without anyone feeding in labeled right answers.

This reads the question as: can a model learn to be well-calibrated (confident when right, unsure when wrong) using only its own internal uncertainty, with no external ground truth or human labels? The corpus doesn't have a note on 'semantic entropy' by that name, but it has a surprisingly coherent body of work on exactly this conceptual territory — using a model's own confidence as the supervisory signal — and the answer it points to is yes, with caveats.

The most direct support is RLSF, which uses the model's confidence over its own answer span to rank reasoning traces and build synthetic preferences — strengthening step-by-step reasoning while actually reversing the calibration damage that standard RLHF causes, all without human labels or external verifiers Can model confidence work as a reward signal for reasoning?. That matters because there's a known failure mode it's fixing: binary correctness rewards quietly destroy calibration, since a reward that only checks right/wrong never penalizes a confident wrong answer, so the model learns to guess loudly. Adding a proper scoring rule (the Brier score) as a second term mathematically forces accuracy and calibration to improve together Does binary reward training hurt model calibration?. Together these say calibration isn't something you can only buy with labeled truth — it can be engineered from the model's own probability signal.

The deeper question is whether that internal signal is actually trustworthy. Two notes suggest it carries real information. Models that are highly confident genuinely resist prompt rephrasing, while low confidence predicts wild output swings — confidence and robustness move together Does model confidence predict robustness to prompt changes?. And in retrieval, calibrated token-probability uncertainty beats far more expensive multi-call adaptive retrieval at deciding when the model needs to look something up — the model's self-knowledge turns out more reliable than external heuristics Can simple uncertainty estimates beat complex adaptive retrieval?. There's also a structural clue about where this signal lives: only about 20% of tokens are high-entropy 'forking points,' and those minority tokens carry the learning signal in reasoning training Do high-entropy tokens drive reasoning model improvements? — implying entropy isn't uniform noise but concentrated at the decisions that matter, which is exactly what a calibration signal would want to latch onto.

The caveats are worth knowing because they're easy to miss. Consistency is not the same as reliability — pinning temperature to zero just replays one draw from the distribution repeatedly; it looks stable but it's still a single sample, not a calibrated estimate Does setting temperature to zero actually make LLM outputs reliable?. And confidence can be confidently wrong for social rather than epistemic reasons: models will decline to correct a user's false claim even when they demonstrably know the right answer, a face-saving habit learned from human conversation Why do language models avoid correcting false user claims?. So an entropy-based self-calibration scheme inherits whatever biases shaped the model's confidence in the first place.

The thing you might not have known you wanted to know: the most promising path here isn't measuring uncertainty after the fact, it's making the model's own confidence the training reward — closing the loop so the model optimizes its calibration against itself. That sidesteps ground truth entirely, but it also means the model is grading its own homework, which works only as far as the underlying confidence signal is honest.

Sources 7 notes

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can semantic entropy improve model calibration without external ground truth?

Sources 7 notes

Next inquiring lines