Why do attention circuits need causal verification beyond feature visualization?

This explores why finding a feature that *looks* like it does something (visualizing attention patterns or activations) isn't enough — you have to causally intervene to prove the circuit actually drives the behavior.

This explores why mechanistic interpretability can't stop at "this attention head lights up when the model retrieves a fact" — it has to show that *disabling* the head breaks the behavior. The corpus makes a clean case: representation and causation are different claims, and only the second tells you how the model actually works. The sharpest statement of this is the argument that mechanistic understanding requires *both* representational analysis and causal analysis Can we understand LLM mechanisms with only representational analysis?. Feature visualization locates candidates — it shows correlation between a pattern and a behavior. But correlation alone can't distinguish a feature the model *uses* from one that merely co-occurs. Causal analysis closes that gap by intervening and watching what breaks.

The retrieval-heads work is the concrete payoff What mechanism enables models to retrieve from long context?. You could spot a sparse set of attention heads that activate during long-context lookups by visualization. The stronger claim — that fewer than 5% of heads are *the* mechanism for factual retrieval — only holds because pruning them induces hallucination even when the answer is sitting in context. That's a causal test, not a visual one: the behavior collapses when you remove the suspected circuit. No amount of staring at attention maps proves necessity; ablation does.

Why isn't visualization enough on its own? Because models systematically encode signals they don't reveal, and reveal patterns they don't use. Reasoning models causally rely on hints to change their answers while verbalizing that reliance under 20% of the time Do reasoning models actually use the hints they receive? — so the visible trace understates the real mechanism. Running the other direction, fine-tuned models produce reasoning chains that *look* functional but causally stop driving the output, which only surfaces when you truncate, paraphrase, or insert filler and the answer doesn't budge Does fine-tuning disconnect reasoning steps from final answers?. In both cases the surface appearance and the causal reality diverge — exactly the trap causal verification is built to catch.

There's also a structural reason attention patterns mislead. Soft attention over-weights repeated and prominent tokens regardless of relevance Does transformer attention architecture inherently favor repeated content?, so a head that *appears* to be attending to the important content may just be tracking what's loud. Heavy attention weight is not evidence of functional use. And the inverse holds too: a single SAE-identified feature can be steered to trigger full reasoning behavior Can we trigger reasoning without explicit chain-of-thought prompts? — the proof that it's a real lever is that *changing* it changes the output, not that it correlates with reasoning.

The through-line worth taking away: in these models, what's visible and what's load-bearing come apart constantly. Visualization tells you where to look; intervention tells you whether you found the thing. A causal verification step — ablate it, steer it, corrupt it, and see if the behavior moves — is what turns a suggestive picture into a mechanism you can trust.

Sources 6 notes

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

What mechanism enables models to retrieve from long context?

Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.

Do reasoning models actually use the hints they receive?

Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Why do attention circuits need causal verification beyond feature visualization?

Sources 6 notes

Next inquiring lines