How does linguistic calibration differ from token probability calibration?

This explores the gap between two senses of 'calibration': whether a model's raw token probabilities track correctness, versus whether its uncertainty is expressed at the level of meaning and dialogue — and the corpus shows these can diverge sharply.

This question is really about *where* you measure a model's confidence. Token probability calibration asks a narrow, mechanical thing: when the model assigns a token (or answer span) a probability of 0.8, is it right about 80% of the time? Linguistic calibration asks something broader — does the uncertainty the model conveys in *meaning* and in *conversation* match what it actually knows? The corpus is interesting precisely because it shows these two layers come apart.

The cleanest demonstration is semantic entropy Can we detect when language models confabulate?. A model can be perfectly fluent and token-confident while confabulating, because the same false claim can be phrased many ways — each individual phrasing looks probable. Only when you sample multiple answers and cluster them by *meaning* (does answer A entail answer B?) does the real uncertainty surface. That's the heart of the distinction: token-level calibration is blind to confabulations that meaning-level calibration catches. The unit of measurement changes the answer.

That said, token-probability calibration is not a poor cousin — it's often shockingly useful. Calibrated token uncertainty beats elaborate adaptive-retrieval heuristics at deciding *when* a model should go look something up, and at a fraction of the cost Can simple uncertainty estimates beat complex adaptive retrieval?. Answer-span confidence can even be recycled as a training reward that simultaneously sharpens reasoning and *restores* the calibration that RLHF tends to erode Can model confidence work as a reward signal for reasoning?. So token probabilities carry real signal — the catch is that fine-tuning for human preference degrades it, which is exactly why people reach for meaning-level measures as a check.

There's a third layer the question quietly points at: calibration as a *conversational act*. Humans calibrate by building common ground — asking clarifying questions, repairing misunderstandings mid-dialogue. LLMs largely skip this, operating in 'static grounding' mode where they answer immediately rather than negotiating what was meant Why do language models skip the calibration step?. This is linguistic calibration in its richest sense: not just *expressing* uncertainty accurately, but *acting* on it by slowing down. Speech and dialogue systems learned this lesson long ago — with 15–30% recognition error rates, they had to maintain belief distributions over user intent rather than commit to one reading Why do dialogue systems need probabilistic reasoning?.

Why do the two layers diverge at all? Because the model's surface output is a sample, not a commitment. A model holds a superposition of plausible continuations and samples one at generation time — regenerate and you get a different, equally confident-sounding answer Do large language models actually commit to a single character?. Token probability calibrates the *sampling distribution*; linguistic calibration tries to calibrate the *thing being claimed*. The reader's takeaway: a model that sounds well-calibrated word-by-word can still be badly calibrated about what it means — and the only way to see that is to stop reading tokens and start comparing meanings.

Sources 6 notes

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

Why do dialogue systems need probabilistic reasoning?

Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

How does linguistic calibration differ from token probability calibration?

Sources 6 notes

Next inquiring lines