Why does probability of text completion not equal knowledge value?

This explores why an LLM scoring a piece of text as likely (high completion probability) is not the same as that text being true or knowledge-bearing — and what the corpus reveals about the gap.

This explores why an LLM scoring a piece of text as likely (high completion probability) is not the same as that text being true or knowledge-bearing. The corpus is unusually direct about this: a model's job is to assign probability mass to continuations, and that mass tracks how often patterns appeared in training — not whether they're correct. The cleanest evidence is a shift-cipher study that decomposed chain-of-thought into three independent factors and found that raw output probability alone swings accuracy from 26% to 70%, with memorization separately tracking pre-training frequency What three separate factors drive chain-of-thought performance?. In other words, a large chunk of what looks like 'knowing the answer' is just the answer being a high-probability string. The same wedge shows up in how models prefer textually frequent paraphrases over semantically equivalent rare ones, across math, translation, and reasoning — they track statistical mass, not meaning Do language models really understand meaning or just surface frequency?.

The most pointed example of probability diverging from truth is attestation bias: models predict that a premise entails a hypothesis based on whether the hypothesis itself appeared in training, even when the premise is random and supports nothing Do LLMs predict entailment based on what they memorized?. The text is 'probable' because it's familiar, not because it's logically warranted. A related failure runs the other direction: when you put correct information in the context window, strong training priors can override it, so the model generates against the evidence in front of it Why do language models ignore information in their context?. Probability is anchored to what was seen often, and fresh true input can lose to it.

This also explains a hard ceiling that surprises people: clever prompting can reorganize and surface what a model already absorbed, but it cannot inject knowledge that was never in the training distribution Can prompt optimization teach models knowledge they lack?. If completion probability equaled knowledge value, you could prompt your way to expertise the model never had. You can't — because probability is a map of the training data's terrain, not of the world. Even how knowledge gets laid down follows probability: whether a new fact 'primes' after training is predictable from the keyword's probability before training, with a threshold below which learning barely takes Can we predict keyword priming before learning happens?.

The interesting twist is that probability isn't worthless as a signal — it just has to be calibrated and used carefully. A model's own token-probability uncertainty turns out to be a better trigger for when to go retrieve external evidence than elaborate heuristics Can simple uncertainty estimates beat complex adaptive retrieval?, and answer-span confidence can even be recycled as a reward to strengthen reasoning while restoring calibration Can model confidence work as a reward signal for reasoning?. The lesson isn't 'probability is meaningless' — it's that probability measures the model's relationship to its training, and knowledge value requires grounding that probability against something external. That's exactly why grounded systems that refuse to answer without supporting evidence outperform confident free generation on noisy sources Can RAG systems refuse to answer without reliable evidence?.

The unsettling coda is that humans replicate the model's own confusion. In 24,000 real interactions, users trusted answers with more citations almost regardless of whether the citations were relevant Do users trust citations more when there are simply more of them?. The fluent, well-decorated, high-probability-looking answer reads as knowledge to us too — which is precisely why the gap between 'likely text' and 'true text' is worth naming out loud.

Sources 10 notes

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

Why does probability of text completion not equal knowledge value?

Sources 10 notes

Next inquiring lines