Does causal intervention alone explain how neural mechanisms implement representations?

This explores whether poking at a model to see what changes (causal intervention) is enough to explain how neural networks build and use internal representations — or whether you also need to map what those representations *are* before the poking means anything.

This explores whether causal intervention alone — ablating a circuit, knocking out a neuron, watching behavior shift — is enough to explain how neural mechanisms implement representations. The corpus's sharpest answer is no, and it's almost a definitional no. The clearest statement Can we understand LLM mechanisms with only representational analysis? frames the two methods as each other's missing half: representational analysis finds correlations without proving they cause anything, while causal analysis shows that flipping a component changes behavior without telling you *what* that component was representing. A clean ablation result tells you a part matters, not what the part is doing. Only the paired move — locate a candidate feature representationally, then verify it causally — produces a complete mechanistic claim.

What makes this more than a methods footnote is how the corpus's actual interpretability successes lean on both legs at once. The work on interpretable-by-design networks Can sparse weight training make neural networks interpretable by design? uses ablation to confirm circuits are 'necessary and sufficient' — but that claim is only legible because sparse training first made the neurons correspond to nameable concepts. Strip the representational story and the ablation just says 'this blob matters.' The same pattern shows up in the finding that networks decompose tasks into modular subnetworks Do neural networks naturally learn modular compositional structure?: pruning experiments (a causal tool) reveal isolated subroutines, but the result is interesting precisely because each subnetwork maps to an identifiable function. Causal intervention is the verifier here, not the explanation.

There's also a prior question the intervention can't reach: where do representations even come from, and why are they shaped the way they are? The observation that representational density is learned through data familiarity Is representational sparsity learned or intrinsic to neural networks? is a developmental fact about training — dense activations for familiar inputs, sparse defaults for novel ones — that no amount of post-hoc ablation would surface. Intervention operates on a finished model; it's silent about the training dynamics that wrote the representations in the first place.

Worth pulling in a parallel from a totally different corner: the critique that causal models alone can't capture human reasoning Can causal models alone capture how humans actually reason?. There, causal belief networks handle causal inference well but miss associative links, analogical mappings, and emotion-driven shifts — they're a tractable starting point, not a complete theory. The structural echo is striking: in both human cognition and neural mechanism, a causal framework is genuinely powerful and genuinely partial. The thing it can't see isn't a bug to patch but a different kind of structure — relational, representational, developmental — that wants its own tools.

So the takeaway you didn't know you were after: 'causal' and 'mechanistic' aren't synonyms. Intervention is how you *test* a mechanistic hypothesis, but the hypothesis itself — what is being represented, and how it got there — comes from somewhere else. The field's working consensus is that mechanism lives in the handshake between the two, not in either one alone.

Sources 5 notes

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Does causal intervention alone explain how neural mechanisms implement representations?

Sources 5 notes

Next inquiring lines