Can uncertainty estimates based on model self-assessment reliably signal errors?

This explores whether a model's own confidence — its internal sense of 'how sure am I?' — can be trusted as an error detector, and the corpus reveals a sharp split between self-assessment as judgment (unreliable) and confidence as a raw signal (surprisingly useful).

This explores whether a model's own confidence can reliably flag when it's wrong — and the collection's answer depends entirely on *how* you extract that confidence. When a model is asked to judge its own answers, it fails: models carry a built-in bias toward trusting outputs they generated themselves, because a high-probability answer simply *feels* correct during self-evaluation Why do models trust their own generated answers?. Reflection makes this worse, not better — across eight models, self-reflection turns out to be mostly confirmatory theater that rarely changes the initial answer, and the reasoning traces don't faithfully describe what the model actually did Can we actually trust reasoning model outputs?. So the intuitive version of self-assessment — 'model, check your work' — is close to a closed loop that rubber-stamps itself.

But there's a second, more mechanical kind of self-assessment that holds up far better: the model's intrinsic token probabilities, read directly rather than asked about. Methods like RLPR and INTUITOR use a model's own probability of generating a correct answer as a reward signal, successfully replacing external verifiers in domains where no answer key exists Can model confidence alone replace external answer verification?. RLSF goes further, using answer-span confidence to rank reasoning traces and actually *restore* calibration that other training had eroded Can model confidence work as a reward signal for reasoning?. The interesting twist here is that the same underlying confidence is reliable when sampled silently and unreliable when the model is asked to narrate it.

The catch is that confidence is only as honest as the training that shaped it. Binary correctness rewards — the workhorse of a lot of RL — actively degrade calibration, because rewarding only right-or-wrong gives models every incentive to guess confidently and no penalty for confident wrong answers; adding a Brier (proper-scoring) term mathematically restores the link between confidence and accuracy Does binary reward training hurt model calibration?. So whether self-assessed uncertainty signals errors isn't a fixed property of models — it's something training can break or repair. When calibration *is* intact, confidence even predicts behavior you'd want from a reliable system: high-confidence models resist prompt rephrasing, while low-confidence ones swing wildly on cosmetic changes Does model confidence predict robustness to prompt changes?.

The collection also warns against a tempting shortcut: mistaking *consistency* for *reliability*. Setting temperature to zero produces the same output every time, but that output is still a single draw from the model's distribution — repeated 100 times it stays identical without becoming any more trustworthy Does setting temperature to zero actually make LLM outputs reliable?. Stable ≠ correct, and a model that confidently repeats itself is not the same as a model that's right.

Where this lands: self-assessed uncertainty *can* signal errors, but the failure modes cluster around self-reference. A model evaluating its own answer is the weakest case; its raw probability signal, properly calibrated, is the strongest. And the danger compounds when errors feed back on themselves — models degrade sharply once their own mistakes contaminate their context Do models fail worse when their own errors fill the context?, and pure self-improvement loops stall precisely because a model can't verify its way past its own blind spots without an external anchor Can models reliably improve themselves without external feedback?. The reader's useful takeaway: trust the confidence signal a model leaks, not the confidence it reports — and never trust either when the model is grading its own homework alone.

Sources 9 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can uncertainty estimates based on model self-assessment reliably signal errors?

Sources 9 notes

Next inquiring lines