How does model confidence relate to accuracy in underfitted domains?

This explores when a model's confidence stops tracking its accuracy — specifically in the thin, undertrained corners of a domain where the model has seen too little to know what it doesn't know.

This explores when a model's confidence stops tracking its accuracy — specifically in the thin, undertrained corners of a domain. The short version the corpus keeps circling: confidence is a decent accuracy signal where the model is well-fit, and a dangerously misleading one where it isn't. The interesting part is *why* the relationship inverts, and what catches the failure when confidence won't.

In well-trodden territory, confidence and correctness move together tightly enough to build on. Calibrated token-probability uncertainty turns out to be a better trigger for "should I retrieve more?" than expensive multi-call heuristics Can simple uncertainty estimates beat complex adaptive retrieval?, and a model's own answer-span probability works well enough as a reward signal to replace external verifiers and even repair calibration that RLHF had degraded Can model confidence work as a reward signal for reasoning? Can model confidence alone replace external answer verification?. Confidence also predicts robustness: highly confident models resist prompt rephrasing, while low-confidence ones swing wildly with wording Does model confidence predict robustness to prompt changes?. So in-distribution, high confidence really does mean something.

The relationship breaks precisely where the data runs thin. The sharpest finding is that confident wrong answers don't look wrong — fluent, assured errors in medical triage, legal, and financial settings concentrate in exactly the rare cases where harm happens, and aggregate accuracy scores hide them because overall numbers stay high Why do confident wrong answers hide in standard accuracy metrics?. The model is most confident and most wrong on the same inputs: novel combinations it never saw enough of. That's the underfitting signature.

Which raises a quietly subversive point: if confidence fails in the rare cases, don't ask the model how sure it is — ask the *training data* how often it saw this. Entity co-occurrence statistics from pretraining flag hallucination risk even when the model reports high confidence, because they catch the root cause (unseen combinations) rather than the symptom (low confidence) Can pretraining data statistics detect hallucinations better than model confidence?. Confidence is a downstream readout that goes blind on rare inputs; the data-side count doesn't. There's a similar move at the trace level — global confidence averaging masks local breakdowns, while step-level confidence catches where reasoning actually fails Does step-level confidence outperform global averaging for trace filtering? — and a related trap in training, where overly hard samples in a domain the model can't fit produce confident degenerate shortcuts that then contaminate working capabilities Do overly hard RLVR samples actually harm model capabilities?.

The part you didn't know you wanted: this isn't only a model problem. Users everywhere — across every language tested — track the model's expressed confidence rather than its actual accuracy, so overconfident errors get followed systematically Do users worldwide trust confident AI outputs even when wrong?. Underfitting produces confident errors; human trust then amplifies exactly those errors. The decoupling of confidence from accuracy in thin domains isn't a calibration curiosity — it's the precise seam where unreliable outputs slip past both the metric and the user.

Sources 9 notes

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do confident wrong answers hide in standard accuracy metrics?

Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do users worldwide trust confident AI outputs even when wrong?

Cross-linguistic research shows users in every language trust confident AI outputs even when inaccurate. While confidence expression varies by language, users everywhere track confidence signals rather than accuracy, making overconfident errors systematically followed.

How does model confidence relate to accuracy in underfitted domains?

Sources 9 notes

Next inquiring lines