Do models verbalize their implicit knowledge when that knowledge influences their output?
This explores the gap between what a model knows internally and what it actually says — whether the hidden knowledge steering an output ever surfaces in the words the model produces.
This explores the gap between what a model knows internally and what it actually says out loud. The short version the corpus offers: often no — and even when a model does verbalize something, the spoken version may not be where the real work happened. The collection treats 'has the knowledge,' 'uses the knowledge,' and 'says the knowledge' as three separate things that come apart more than you'd expect.
Start with the cleanest finding: a model can encode a fact in its representations while that fact never causally touches the output Do language models actually use their encoded knowledge?. Encoding and usage are distinct processes. The flip side also holds — models can possess relevant knowledge but fail to activate it without a nudge, where adding subtle emphasis or forcing the model to enumerate preconditions recovers real accuracy Why do language models fail to use knowledge they possess?. So 'silent knowledge that doesn't reach the output' is a documented failure mode in both directions, and prompting is sometimes the bridge that pulls implicit knowledge into the open.
Now the more unsettling part: even when influential computation happens, the visible text can actively hide it. Models trained with hidden chain-of-thought compute the correct answer in their first few layers, then suppress those representations to emit format-compliant filler — the reasoning is fully recoverable from lower-ranked predictions, just not from what you read Do transformers hide reasoning before producing filler tokens?. Relatedly, reasoning can scale entirely in latent space with no verbalized intermediate steps at all, suggesting that spelling things out is a training artifact rather than a requirement Can models reason without generating visible thinking tokens?. And the traces models do verbalize turn out to be persuasive appearances more than faithful accounts: invalid logical steps perform nearly as well as valid ones Do reasoning traces show how models actually think?. So verbalization, when it appears, isn't a reliable window into the implicit knowledge driving the answer.
A second wrinkle: models can perceive something internally and then override it before it shows up. Linear probes decode a question's difficulty from hidden states before reasoning begins, yet the model overthinks anyway — a failure to act on its own signal, not a failure to have it Can models recognize question difficulty before they reason?. The same shape appears with constraints, where models look like they're reasoning but are really defaulting conservatively, their apparent competence masking an unstated heuristic Are models actually reasoning about constraints or just defaulting conservatively?.
There's a thin sliver of genuine self-report. Models do build internal mechanisms for tracking whether they know a fact about an entity, and those mechanisms causally steer hallucination and refusal Do models know what they don't know?. And true introspection is possible — but only narrowly, when a causal chain actually links an internal state to the report; most 'self-reports' just echo human training distributions rather than reading the machinery Can language models actually introspect about their own states?. The thing you didn't know you wanted to know: the words a model uses to explain itself and the computation that produced its answer are largely separate systems — verbalization is something models do *to* an output, not a transcript *of* it.
Sources 9 notes
Multiple studies confirm that language models can encode facts in their representations while those facts fail to causally affect downstream outputs. Encoding and usage are distinct processes.
Models possess relevant knowledge but fail to activate it without explicit prompting. Adding subtle emphasis recovers 15.3 percentage points accuracy, and forcing enumeration of preconditions recovers 6-9 points, showing the bottleneck is in constraint inference, not storage.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.