LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Existing mechanistic interpretability approaches (SAEs, activation patching, circuit discovery) attempt to understand dense models post-hoc. Weight-sparse training offers a fundamentally different paradigm: make the model interpretable by construction.

The approach: constrain most weights to be zeros (small L0 norm). Each neuron can only read from or write to a few residual channels, which discourages distributing representations across channels and using excess neurons. The result: disentangled circuits where neuron activations correspond to simple concepts ("tokens following a single quote," "depth of list nesting") with straightforward, intuitive connections.

Three key findings:

  1. Disentangled task circuits. Isolating minimal circuits for each task shows they are compact. Different tasks use different circuits with minimal overlap. This validates the hypothesis that superposition is what makes dense models hard to interpret — remove the superposition pressure and interpretation becomes tractable.

  2. Necessary and sufficient. Mean-ablating every neuron except the circuit preserves task performance. Deleting only the circuit nodes severely harms it. This is an unusually rigorous validation for interpretability claims.

  3. Capability-interpretability tradeoff with scaling. Making weights sparser decreases capability. Scaling model size improves the frontier — larger sparse models are more capable at the same interpretability level. But scaling beyond tens of millions of nonzero parameters while preserving interpretability remains unsolved.

The critical limitation: weight-sparse models are extremely inefficient to train and deploy, and unlikely to reach frontier capabilities. This is interpretability-by-construction for research models, not a path to understanding GPT-4.

However, preliminary results suggest the method can be adapted to explain existing dense models — training sparse approximations that reveal interpretable structure present in the dense original. If this scales, it bridges the gap between the paradigm's elegance and practical utility.


Source: MechInterp

Related concepts in this collection

Concept map
13 direct connections · 101 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

weight sparsity produces interpretable disentangled circuits — a new paradigm trading capability for interpretability