Language Understanding and Pragmatics LLM Reasoning and Architecture

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Note · 2026-02-21 · sourced from Discourses
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

This is one of the more precise and counterintuitive findings in LLM interpretability: a knowledge probe can confirm that a fact is encoded in the model's internal representations — it can be extracted by a linear classifier — while that same fact fails to causally influence downstream generation.

The REMEDI paper is explicit: "even when an LM encodes information in its representations, this information may not causally influence subsequent generation." This has been independently documented by Ravfogel et al. (2020), Elazar et al. (2021), and Ravichander et al. (2021).

The mechanism: LM representations are computed as part of the forward pass, but which aspects of those representations actually influence the token generation at the end depends on attention patterns and downstream computations. A fact can be "stored" in a representation without that storage being on the causal path to the output.

This breaks a common assumption in interpretability and evaluation: that probing success implies behavioral relevance. If you can decode that the model "knows" something, you might assume it will generate outputs consistent with that knowledge. But this assumption is empirically false. The model may encode and fail to use.

The practical consequence for REMEDI: effective knowledge editing requires finding causal directions — representations that, when modified, actually change the output. Simply finding where knowledge is encoded is not sufficient. This is why REMEDI adds edited fact vectors to specific layers at specific tokens, not just anywhere in the representation.

For interpretability broadly: probing is a necessary but not sufficient condition for behavioral inference. Encoding ≠ generation.

Mechanistic interventions that close the gap: Two mechanistic interpretability approaches directly address this encoding-generation dissociation. Inference-Time Intervention (ITI) identifies a subset of attention heads where "truthful" directions can be extracted, then shifts activations along those directions at inference time — improving LLaMA truthfulness from 32.5% to 65.1% on TruthfulQA. The key insight: the model "knows" more than it "says," and the gap can be partially closed by targeting specific attention heads rather than the full representation. Eliciting Latent Knowledge (ELK) confirms this from a different angle: linear probes in middle layers can report a model's knowledge independently of what the model outputs, even when the model has been finetuned to produce systematically untruthful responses. Together, ITI and ELK demonstrate that the encoding-generation gap is not absolute — it can be bridged through targeted intervention on the causal pathways between encoded knowledge and generation.


Source: Discourses; enriched from MechInterp

Related concepts in this collection

Concept map
26 direct connections · 215 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

information encoded in lm representations may not causally influence generation