Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
This is one of the more precise and counterintuitive findings in LLM interpretability: a knowledge probe can confirm that a fact is encoded in the model's internal representations — it can be extracted by a linear classifier — while that same fact fails to causally influence downstream generation.
The REMEDI paper is explicit: "even when an LM encodes information in its representations, this information may not causally influence subsequent generation." This has been independently documented by Ravfogel et al. (2020), Elazar et al. (2021), and Ravichander et al. (2021).
The mechanism: LM representations are computed as part of the forward pass, but which aspects of those representations actually influence the token generation at the end depends on attention patterns and downstream computations. A fact can be "stored" in a representation without that storage being on the causal path to the output.
This breaks a common assumption in interpretability and evaluation: that probing success implies behavioral relevance. If you can decode that the model "knows" something, you might assume it will generate outputs consistent with that knowledge. But this assumption is empirically false. The model may encode and fail to use.
The practical consequence for REMEDI: effective knowledge editing requires finding causal directions — representations that, when modified, actually change the output. Simply finding where knowledge is encoded is not sufficient. This is why REMEDI adds edited fact vectors to specific layers at specific tokens, not just anywhere in the representation.
For interpretability broadly: probing is a necessary but not sufficient condition for behavioral inference. Encoding ≠ generation.
Mechanistic interventions that close the gap: Two mechanistic interpretability approaches directly address this encoding-generation dissociation. Inference-Time Intervention (ITI) identifies a subset of attention heads where "truthful" directions can be extracted, then shifts activations along those directions at inference time — improving LLaMA truthfulness from 32.5% to 65.1% on TruthfulQA. The key insight: the model "knows" more than it "says," and the gap can be partially closed by targeting specific attention heads rather than the full representation. Eliciting Latent Knowledge (ELK) confirms this from a different angle: linear probes in middle layers can report a model's knowledge independently of what the model outputs, even when the model has been finetuned to produce systematically untruthful responses. Together, ITI and ELK demonstrate that the encoding-generation gap is not absolute — it can be bridged through targeted intervention on the causal pathways between encoded knowledge and generation.
Source: Discourses; enriched from MechInterp
Related concepts in this collection
-
Why do language models ignore information in their context?
Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
a specific case where encoding (the contextual information) fails to influence generation
-
Do classical knowledge definitions apply to AI systems?
Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?
this finding further complicates what "knowledge" means in LLMs
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
mechanistic substrate: layer localization explains the encoding-generation gap — lower-layer knowledge may fail to causally influence generation when higher-layer reasoning adjustment overrides or misapplies it
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
RepE framework provides the theoretical basis: truthfulness (output matches facts) and honesty (output matches beliefs) are separable, and the encoding-generation gap is one mechanism that produces their divergence
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
ITI and ELK are both RepE-style interventions that work at the representation level rather than the circuit level
-
Do personality traits activate hidden emoji patterns in language models?
When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.
a positive counterexample: personality-associated emoji patterns are encoded latently during pre-training and DO causally emerge through fine-tuning, demonstrating that the encoding-generation gap can be closed by targeted parameter-efficient activation of specific neurons
-
Why do language models fail to act on their own reasoning?
LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?
quantified instance
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
positive counterexample for reasoning: SAE-identified reasoning features ARE on the causal path — steering one feature activates reasoning across 6 model families, demonstrating that for reasoning specifically the encoding-generation gap can be fully closed: the 87% correct rationales vs 64% correct actions demonstrates the encoding-generation gap in action selection; the reasoning trace is generated through one pathway while action selection draws on shallower habitual computations, and RL fine-tuning partially closes the gap
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
information encoded in lm representations may not causally influence generation