How do language models organize features across processing layers?
Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.
Anthropic's circuit tracing work uses attribution graphs built from sparse autoencoders to reveal computational graphs in Claude models. The key finding is a consistent four-tier hierarchy of feature types across model layers:
Input features (early layers) — activate on specific tokens or token categories. A "digital" feature fires on "digital", "digitize", etc. These are the raw perceptual layer.
Abstract features (middle/later layers) — represent properties of context rather than surface tokens. Example: a feature for "the danger of mixing common cleaning chemicals." These are genuine conceptual representations disconnected from specific words.
Functional features (middle/later layers) — perform operations rather than represent concepts. An "add 9" feature causes the model to output a number nine greater than another in context. These are computational primitives, not representational ones.
Output features (late layers) — promote specific outputs or output categories. A "say a capital" feature promotes tokens corresponding to U.S. state capital names.
Polysemantic features (activating for unrelated concepts like "rhythm", Michael Jordan, and other things) are concentrated in earlier layers, consistent with superposition being a compression strategy that gets resolved as processing deepens.
Critically, feature abstractions are richer in larger models (Haiku vs 18L). This suggests that scaling doesn't just add more features — it adds more abstract features, consistent with the idea that capability gains come from developing higher-level internal concepts rather than just memorizing more patterns.
Features also vary in how many layers they "live" across — some contribute to one or two layers while others have strong outputs all the way through. This gradient from local to global features challenges simple circuit-based accounts where each feature has a fixed location.
The distinction between abstract features (representing what) and functional features (computing how) is particularly important for interpretability: standard probing approaches that look for representations of concepts would entirely miss the functional features that implement the actual computation. This connects to the broader finding that Do standard analysis methods hide nonlinear features in neural networks?.
Source: MechInterp
Related concepts in this collection
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
RepE operates at the population level and would detect abstract features but may miss functional features that implement operations rather than represent concepts
-
Can sparse weight training make neural networks interpretable by design?
Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.
weight sparsity forces feature disentanglement by construction; circuit tracing achieves interpretability post-hoc through SAE decomposition
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
the four-tier hierarchy provides a framework for asking where FER occurs: fracture at the abstract tier would be most damaging to generalization
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
circuit tracing reveals a four-tier feature hierarchy in language models — input features to abstract concepts to functional operations to output features