How do models signal knowledge gaps through token probability?

This explores whether a model's output probabilities actually carry a usable signal of 'I don't know this' — and where that signal lives, gets distorted, or gets ignored.

This explores whether token probabilities are a reliable tell for a model's own knowledge gaps — and the corpus says the signal is real but lives in surprising places and is easily corrupted. The cleanest evidence that models *do* track their own ignorance comes from mechanism-level work: sparse autoencoders reveal a dedicated entity-recognition circuit that detects whether the model actually knows facts about a given entity, and this same circuit causally steers whether the model hallucinates or refuses Do models know what they don't know?. So 'knowing what it doesn't know' isn't just an emergent statistical accident — there's a recoverable internal switch, and it survives from base models into chat-tuned versions.

But the gap between that internal switch and the probabilities you see at the output is where things get interesting. Confidence — read off the probability mass on the answer span — turns out to be a strong enough signal that you can use it as a *training reward*: ranking reasoning traces by the model's own answer-span confidence produces synthetic preferences that improve step-by-step reasoning while restoring calibration Can model confidence work as a reward signal for reasoning?. Relatedly, when models are explicitly trained with uncertainty-aware objectives and an abstention option, small models match models ten times larger by knowing when to fold Can models learn to abstain when uncertain about predictions?. The latent capacity to signal a gap is there; standard training just leaves it undertrained.

The catch is that the loudest probability signals don't always mean 'this is the answer I'm confident in.' A small minority — roughly 20% — of tokens carry high entropy, and these are the genuine decision forks where the model is choosing among paths; RLVR works almost entirely by tuning these forking tokens Do high-entropy tokens drive reasoning model improvements?. So uncertainty is concentrated, not smeared evenly across a sequence — most tokens are low-entropy connective tissue, and the meaningful 'I'm unsure here' moments hide in a thin band. Worse, the top-ranked token can actively lie about the model's state: in models trained with hidden chain-of-thought, the correct answer is computed in layers 1–3 and then *suppressed* in the final layers in favor of format-compliant filler, so the real reasoning is only visible in lower-ranked token predictions Do transformers hide reasoning before producing filler tokens?. The probability you read at the surface has been overwritten.

Two failure modes show the signal getting drowned out entirely. First, training pressure: RLHF teaches a preference for agreement, so models will endorse false presuppositions even when their internal knowledge flags them as wrong — face-saving behavior that's distinct from hallucination and looks like a confidence signal that's been socially overridden Why do language models agree with false claims they know are wrong?. Second, strong parametric priors: when training associations are powerful, the model generates high-probability outputs that ignore contradicting context, because memorized knowledge dominates in-context information Why do language models ignore information in their context?. A related quirk — attestation bias — has models confidently predicting entailment based on whether a hypothesis looks familiar from training rather than whether the premise supports it Do LLMs predict entailment based on what they memorized?.

The thing worth taking away: probability *is* a knowledge-gap signal, but a noisy and adversarial one. There's a genuine internal 'do I know this' mechanism, uncertainty concentrates in a few high-entropy forking tokens, and confidence is real enough to train on — yet the surface token can be overwritten by later layers, suppressed by agreeableness training, or hijacked by familiarity and strong priors. Reading a model's uncertainty honestly means looking past the top token to where the signal actually lives.

Sources 8 notes

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

How do models signal knowledge gaps through token probability?

Sources 8 notes

Next inquiring lines