Can an LLM be well calibrated but still unreliable on single evaluations?
This explores the gap between calibration in the aggregate — a model's confidence scores matching real-world frequencies over many samples — and reliability on the single output you actually receive, which the corpus treats as two genuinely different things.
This question reads as: even if a model's confidence is well-tuned on average, can the one answer in front of you still be untrustworthy? The corpus says yes, and the sharpest framing comes from the distinction between a fixed output and a reliable one. Setting temperature to zero or pinning a seed makes a model repeat the *same* answer every time, but that answer is still a single draw from a probability distribution — consistency is not reliability Does setting temperature to zero actually make LLM outputs reliable?. The McDonald's-omega testing described there, running 100 repetitions to measure reliability, only makes sense *because* any single evaluation is uninformative about whether the model would land somewhere else on a different draw. Calibration lives in the distribution; the answer you got lives at one point in it.
The failure gets worse where it's hardest to notice. In specialized domains, models pair low accuracy with high confidence, and prompting tricks that fix general-purpose calibration don't dent the overconfidence Why do language models fail confidently in specialized domains?. So a model could look acceptably calibrated on a broad benchmark while being confidently wrong on exactly the clinical or niche query you care about — average calibration masking single-case unreliability. The benchmarks themselves conspire here: standard NLP evaluations quietly filter out the ambiguous examples where annotators disagree, and on those held-back cases accuracy collapses from 90% to 32% Do standard NLP benchmarks hide LLM ambiguity failures?. The aggregate number stays clean precisely because the cases that would expose single-shot failure were removed.
The corpus's most useful move against this is *abstention*. When LLM judges are asked to predict sparse user preferences they fail — but adding verbal uncertainty estimation, letting the model decline to answer rather than be forced into a judgment, recovers reliability above 80% on the high-certainty subset Why do LLM judges fail at predicting sparse user preferences?. That's the practical reframe: a model can't make every single evaluation reliable, but a well-calibrated one can at least tell you *which* single evaluations to trust. Reliability per-case becomes a routing decision, not a property of the model overall.
There's also a deeper reason single evaluations mislead, which is that the unit of evaluation is wrong. Short-interaction benchmarks don't predict long-horizon performance — models that rank identically on single-turn tasks diverge dramatically by relay 25 of a delegated workflow Do short benchmarks predict how models perform over long workflows?. And capability that only exists across sessions — accumulated context, reusable procedures, human steering — is invisible to any episode-level measurement at all Should we evaluate deployed agents as whole environments instead?. A single evaluation isn't just a noisy sample of a stable quantity; sometimes the quantity that matters doesn't exist at the single-evaluation scale.
The thing you might not have expected to learn: "well calibrated" and "reliable on this answer" can pull in opposite directions, because calibration is most easily achieved on exactly the clean, unambiguous, single-turn cases that benchmarks keep — while the unreliability hides in the ambiguous, specialized, and long-horizon cases those benchmarks throw away. The fix isn't better point-estimates; it's models that know when to abstain and evaluations measured at the scale where the work actually happens.
Sources 6 notes
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
LLMs trained on general text lack sufficient exposure to domain-specific examples, leading to low accuracy paired with high confidence in clinical NLI tasks. Prompting techniques that improved general performance fail to reduce overconfidence in specialized domains.
By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.
Sparse persona information lacks predictive power for specific preferences, causing LLM judges to fail. Verbal uncertainty estimation recovers reliability above 80% on high-certainty samples by allowing abstention rather than forced judgment.
DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.
A single-investigator case study with 75,671 telemetry records shows that capacity gains come from accumulated context and reusable procedures that only exist across sessions with human direction. Model and episode-level evaluation cannot measure these cross-session variables.