SYNTHESIS NOTE
Model Architecture and Internals Language, Text, and Discourse

Can dictionary learning scale to production language models?

Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.

Synthesis note · 2026-06-03 · sourced from Evaluations

Eight months after sparse autoencoders recovered monosemantic features from a one-layer transformer, the open question was whether the method scales — if it cannot reach state-of-the-art models, it cannot contribute to safety. This work answers it: dictionary learning extracts high-quality features from Claude 3 Sonnet, a medium-sized production model. The approach rests on two hypotheses worth stating because they are load-bearing: the linear representation hypothesis (concepts are directions in activation space) and the superposition hypothesis (networks use almost-orthogonal directions to pack more features than dimensions). Sparse autoencoders are the dictionary-learning approximation that exploits this.

The recovered features are notable on three dimensions. They are abstract — features for famous people, countries and cities, type signatures in code. They are multilingual and multimodal — the same feature responds to a concept across languages and in both text and images. And they span abstract and concrete instantiations of one idea (code with security vulnerabilities and abstract discussion of security vulnerabilities fire the same feature). Most consequentially, the features are not merely correlational: they both respond to and behaviorally cause the relevant behaviors — clamping a feature steers the model.

The significance for the vault is that interpretability is tractable at production scale, not just in toy models — a precondition for any feature-level safety or steering work. It sits against Do standard analysis methods hide nonlinear features in neural networks?: SAEs recover an impressive feature diversity, but that caution remains live — what dictionary learning surfaces may still over-represent the linearly accessible.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 100 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

dictionary learning scales to production models recovering abstract multimodal features that both detect and causally cause behavior