How does mechanistic interpretability complement learning mechanics in explaining deep learning?

This explores how two ways of looking inside neural networks fit together: mechanistic interpretability (reverse-engineering the actual circuits a trained model uses) and learning mechanics (modeling how those structures form during training), and why you need both to explain why deep learning works.

This explores how two complementary lenses — mechanistic interpretability, which dissects the circuits a finished model runs, and learning mechanics, which models how training dynamics produce those circuits in the first place — together explain deep learning better than either alone. The short version: interpretability tells you *what* structure a network ended up with, learning mechanics tells you *how and why* it got there, and the interesting findings live exactly where the two meet. Learning mechanics is consolidating into a unifying frame for deep learning theory, borrowing from statistical mechanics by favoring average-case behavior and training dynamics over worst-case bounds Can deep learning theory unify around training dynamics?. Mechanistic interpretability, meanwhile, has matured into a discipline with its own methodological standard: you can't get a real mechanistic claim from representational analysis alone (which only finds correlations) — you have to locate a candidate feature, then causally verify it by intervening Can we understand LLM mechanisms with only representational analysis?.

The complementarity becomes concrete around modularity. Interpretability work shows networks decompose compositional tasks into isolated subnetworks, where ablating one piece only breaks its corresponding function — but the crucial twist is a *learning* fact: pretraining substantially increases how cleanly and reliably that modular structure forms Do neural networks naturally learn modular compositional structure?. You can even force the issue from the training side: sparse-weight training produces compact, human-readable circuits by construction, making interpretability a property you engineer through the learning process rather than recover afterward Can sparse weight training make neural networks interpretable by design?. The same dance shows up in representation density — networks learn to fire densely for familiar data and stay sparse for unfamiliar inputs, a structural signature that is purely a product of training exposure Is representational sparsity learned or intrinsic to neural networks?.

The reason you genuinely *need* both becomes sharpest in the cases where behavior lies. A cluster of work shows that two models with identical performance can have radically different internal organization: SGD-trained networks reproduce outputs perfectly while harboring "fractured, entangled" representations that shatter under perturbation or distribution shift — damage that no benchmark detects Can identical outputs hide broken internal representations? Can models be smart without organized internal structure? Can AI pass every test while understanding nothing?. A model can hold every linearly-decodable feature a task needs and still be internally broken What actually happens inside a language model?. Here interpretability supplies the alarm (the structure is wrong) but learning mechanics supplies the diagnosis (the training procedure, SGD, is what produced fracture rather than clean structure) — neither explanation is complete without the other.

The payoff is a layered, non-binary picture of what "understanding" even means inside a model. Interpretability reveals three coexisting tiers — features as directions, factual world-knowledge, and compact principled circuits — with higher tiers sitting *on top of* lower-tier heuristics rather than replacing them, leaving a patchwork Do language models understand in fundamentally different ways?. And the two lenses converge on a striking claim about capability itself: base models already contain latent reasoning circuitry that five independent methods can elicit, so post-training *selects* structure rather than creating it Do base models already contain hidden reasoning ability?. That is the synthesis in one line — what looks like learning a new skill is often a learning-dynamics process surfacing a circuit that interpretability can already find. The thing worth knowing you wanted to know: equal scores can hide unequal minds, and only by reading the circuits *and* the training history together can you tell a robust network from a lucky one.

Sources 11 notes

Can deep learning theory unify around training dynamics?

Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

How does mechanistic interpretability complement learning mechanics in explaining deep learning?

Sources 11 notes

Next inquiring lines