Do models actually self-assess their confidence or just confirm answers?

This explores whether LLMs genuinely evaluate how likely they are to be right, or whether 'confidence' is just a model rubber-stamping whatever it already produced.

This explores whether LLMs genuinely evaluate how likely they are to be right, or whether 'confidence' is mostly a model rubber-stamping its own output. The corpus splits into two camps that, read together, suggest the honest answer is *both* — models do carry a usable internal signal, but they also have a strong built-in pull toward agreeing with themselves, and the two are easy to confuse.

Start with the skeptical evidence. Models carry a structural bias toward trusting answers they generated, because a high-probability output simply *feels* more correct when the same model re-reads it Why do models trust their own generated answers?. That self-agreement loop turns toxic when a model revises its own work: instead of catching errors, single-model self-revision tends to make the model *more* confident in wrong answers — a failure mode that only reverses when you bring in genuinely different models to argue Does a model improve by arguing with itself?. And much of what looks like a model 'reporting its confidence' is really an echo of training-data patterns rather than any inspection of its own internal state Can language models actually introspect about their own states?. So a lot of apparent self-assessment is confirmation wearing a confidence costume.

But the optimistic camp shows there's a real signal underneath, if you read the right thing. The model's *intrinsic token probability* of a correct answer is informative enough to replace external verifiers as a reward during training Can model confidence alone replace external answer verification?, and using answer-span confidence to rank reasoning traces actually *restores* calibration that standard RLHF degrades Can model confidence work as a reward signal for reasoning?. Confidence even predicts behavior: highly confident models resist prompt rephrasing, while low-confidence ones swing wildly with wording Does model confidence predict robustness to prompt changes?. The key distinction the corpus keeps drawing is *comparison vs. self-agreement* — confidence becomes meaningful when an answer is weighed against alternatives or judged pairwise Can models learn to judge themselves without external rewards?, and models can even be trained to internalize this self-evaluation in unused sequence space Can models learn to evaluate their own work during training?. Same model, different framing: ask 'is this answer good?' and you get confirmation; ask 'is this answer better than that one?' and you get assessment.

There's also a hard ceiling worth knowing about. Self-assessment can only help where a model verifies better than it generates — the 'generation-verification gap.' That gap is real for reasoning but collapses for factual recall, which neatly predicts *where* self-confidence is trustworthy and where it's just self-confirmation What limits how much models can improve themselves?.

The twist you didn't come asking for: the riskiest part of this isn't the model — it's you. Users across every language tested track a model's *expressed* confidence rather than its actual accuracy, faithfully following overconfident wrong answers Do users worldwide trust confident AI outputs even when wrong?. So even where a model's internal confidence is poorly calibrated, its outward confidence performance lands as if it were earned — which is exactly the confirmation-masquerading-as-assessment problem, now playing out in the human reading the screen.

Sources 10 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Do users worldwide trust confident AI outputs even when wrong?

Cross-linguistic research shows users in every language trust confident AI outputs even when inaccurate. While confidence expression varies by language, users everywhere track confidence signals rather than accuracy, making overconfident errors systematically followed.

Do models actually self-assess their confidence or just confirm answers?

Sources 10 notes

Next inquiring lines