Can linear probing detect all the concepts a language model actually uses?

This explores the gap between what a linear probe can read off a model's internal representations and what the model actually relies on when it generates — two things the corpus treats as genuinely distinct.

This explores whether reading concepts out of a model's hidden states (linear probing) actually captures the concepts the model uses in practice. The corpus suggests the answer is no — and for two opposite reasons that are easy to conflate. The first is that probing can find concepts the model doesn't use. Several studies converge on the finding that facts and features encoded in a model's representations frequently fail to causally influence what it outputs Do language models actually use their encoded knowledge?. A probe lights up — the information is demonstrably 'in there' — yet steering or ablating it changes nothing downstream. Encoding and usage are separate processes, so a positive probe is evidence of presence, not of use.

The second reason cuts the other way: the model can use concepts a linear probe won't cleanly detect. Mechanistic interpretability work describes understanding as layered — some concepts live as simple linear directions a probe is built to catch, but higher-tier 'principled' understanding lives in compact multi-step circuits, and these coexist with cheap heuristics rather than replacing them Do language models understand in fundamentally different ways?. A linear probe is tuned for the directions tier; the circuit-level machinery and the heuristic shortcuts can be doing real work while sitting outside what a linear readout resolves. So the set of probe-detectable concepts and the set of used concepts overlap but neither contains the other.

This is the same encoding-versus-doing split that shows up across the collection's failure literature. Potemkin understanding is the behavioral version of it — a model can state a concept correctly, fail to apply it, and even flag its own failure, which means the 'knows it' signal and the 'uses it' signal are running on disconnected pathways Can LLMs understand concepts they cannot apply?. Reasoning traces tell a parallel story: the visible chain of thought reads like genuine concept-use, but corrupted and invalid traces perform almost as well as valid ones, so the trace is a persuasive surface, not a window onto the computation that actually produced the answer Do reasoning traces show how models actually think?. In each case, the thing you can observe (a probe hit, a fluent explanation, a clean trace) is decoupled from the thing you care about (causal use).

The sharper takeaway for someone curious about probing as an interpretability tool: detection and influence are different measurements, and a complete map of 'what concepts a model uses' needs causal intervention, not just decodability. The corpus organizes these decouplings as a family of structurally distinct epistemic failure modes rather than one-off bugs How do LLMs fail to know what they seem to understand? — which is the surprising part. The limit on linear probing isn't a calibration problem you can probe your way past; it's that 'a concept is linearly readable' and 'a concept drives behavior' are answers to two different questions, and a model is full of cases where only one of them is true.

Sources 5 notes

Do language models actually use their encoded knowledge?

Multiple studies confirm that language models can encode facts in their representations while those facts fail to causally affect downstream outputs. Encoding and usage are distinct processes.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Can linear probing detect all the concepts a language model actually uses?

Sources 5 notes

Next inquiring lines