Do internal belief probes reveal what models actually know versus report?
This explores whether reading a model's internal representations (probes) tells us what it 'really knows' — and whether that internal knowledge matches what the model says out loud.
This explores whether peeking inside a model's activations reveals knowledge that its outputs hide — the gap between what a model encodes internally and what it reports. The short version from the corpus: there is a real gap, but it cuts in surprising directions, and 'what the model knows' turns out to be a slippery thing to pin down.
The sharpest finding is that encoding and using knowledge are two different processes. Models can hold a fact in their representations while that fact never causally shapes the output Do language models actually use their encoded knowledge?. So a probe that lights up doesn't prove the model 'knows' in any behaviorally meaningful sense — it may have detected a fact the model never acts on. This is the deeper version of a general decoupling: models can hit identical accuracy with radically different internal structure, and circuits that look interpretable may not actually drive the answer What actually happens inside the minds of language models?. Reading internals is necessary, but a probe's signal and the model's behavior are not the same variable.
That said, some internal states genuinely do steer behavior — which is what makes probing worth doing. Sparse autoencoders found a self-knowledge mechanism: models track whether they recognize an entity, and that signal causally pushes them toward either answering or refusing/hallucinating Do models know what they don't know?. Here the internal representation really does reveal something the model 'knows about its own knowledge,' and it shapes the report. So probes can reveal know-vs-report mismatches precisely because the know-signal sometimes wins and sometimes loses.
On the 'report' side, the news is humbling. When models describe their own states in words, those self-reports mostly echo training-data patterns rather than read off genuine internal processes — true introspection happens only in the narrow cases where a causal chain links the internal state to the verbal claim Can language models actually introspect about their own states?. Reasoning traces are even worse as evidence: invalid logical steps perform almost as well as valid ones, meaning the visible 'thinking' is persuasive style, not a window into computation Do reasoning traces show how models actually think?. This is the core case for probes over self-reports — the model's words about its own mind are unreliable, so you have to look at the machinery.
The twist worth taking away: probing isn't a neutral readout, because training shapes what's visible. Detection circuits that let a model notice internal perturbations are actively suppressed by safety training — one study watched perturbation-detection drop from 64% to 11% after alignment How do language models detect injected steering vectors internally?. So the gap between what a model 'actually knows' internally and what it reports isn't just an architectural accident; it can be a learned policy. Internal probes reveal knowledge the report omits — but the same training that polishes the report can dim the very signals a probe relies on.
Sources 6 notes
Multiple studies confirm that language models can encode facts in their representations while those facts fail to causally affect downstream outputs. Encoding and usage are distinct processes.
LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.