LLM Reasoning and Architecture

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Note · 2026-04-18 · sourced from MechInterp
What actually happens inside the minds of language models? How should researchers navigate LLM reasoning research?

Anthropic's circuit tracing work uses attribution graphs built from sparse autoencoders to reveal computational graphs in Claude models. The key finding is a consistent four-tier hierarchy of feature types across model layers:

  1. Input features (early layers) — activate on specific tokens or token categories. A "digital" feature fires on "digital", "digitize", etc. These are the raw perceptual layer.

  2. Abstract features (middle/later layers) — represent properties of context rather than surface tokens. Example: a feature for "the danger of mixing common cleaning chemicals." These are genuine conceptual representations disconnected from specific words.

  3. Functional features (middle/later layers) — perform operations rather than represent concepts. An "add 9" feature causes the model to output a number nine greater than another in context. These are computational primitives, not representational ones.

  4. Output features (late layers) — promote specific outputs or output categories. A "say a capital" feature promotes tokens corresponding to U.S. state capital names.

Polysemantic features (activating for unrelated concepts like "rhythm", Michael Jordan, and other things) are concentrated in earlier layers, consistent with superposition being a compression strategy that gets resolved as processing deepens.

Critically, feature abstractions are richer in larger models (Haiku vs 18L). This suggests that scaling doesn't just add more features — it adds more abstract features, consistent with the idea that capability gains come from developing higher-level internal concepts rather than just memorizing more patterns.

Features also vary in how many layers they "live" across — some contribute to one or two layers while others have strong outputs all the way through. This gradient from local to global features challenges simple circuit-based accounts where each feature has a fixed location.

The distinction between abstract features (representing what) and functional features (computing how) is particularly important for interpretability: standard probing approaches that look for representations of concepts would entirely miss the functional features that implement the actual computation. This connects to the broader finding that Do standard analysis methods hide nonlinear features in neural networks?.


Source: MechInterp

Related concepts in this collection

Concept map
12 direct connections · 86 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

circuit tracing reveals a four-tier feature hierarchy in language models — input features to abstract concepts to functional operations to output features