How do mechanistic features compare to natural language for interpretability?

This explores two rival ways to make a model's inner workings legible: prying open the machinery to find features and circuits (mechanistic interpretability) versus having the model — or simple text-level signals — explain in plain language. The corpus suggests these aren't competitors so much as tools with opposite failure modes.

This explores two rival ways to make a model's inner workings legible: prying open the machinery to find features and circuits versus relying on natural language explanation. The mechanistic camp works from the inside out. Sparse autoencoders can now pull abstract, multilingual features out of a production model like Claude 3 Sonnet — features that both fire in response to concepts and, when nudged, causally change what the model does Can dictionary learning scale to production language models?. That causal handle is the whole point: representational analysis on its own only finds correlations, so a complete mechanistic claim requires locating a candidate feature *and* intervening to prove it matters Can we understand LLM mechanisms with only representational analysis?. Done well, this resolves models into a layered picture — concepts as directions, facts as connections, compact circuits as principled reasoning — though the tiers coexist with cruder heuristics rather than replacing them Do language models understand in fundamentally different ways?.

Natural language interpretability runs the other direction: it leans on the model's own fluency to scale up how much complexity a human can absorb, even opening the door to one model auditing another. But that same fluency is the trap — explanations can feel completely plausible while being unfaithful to what the model actually did, a hallucinated rationale Can natural language explanations redefine what interpretability means?. So the comparison sharpens into a trade-off: mechanistic features buy you faithfulness at the cost of effort and abstraction, while language buys you reach and readability at the cost of trust.

What you might not expect is that the language end of the spectrum sometimes wins outright on practical grounds. Lightweight, fully interpretable linguistic features hit 99% accuracy detecting LLM-written arguments — matching heavyweight neural detectors while staying cheap and transparent, because models leave readable stylistic fingerprints Can simple linguistic features detect AI-written arguments?. Surface-level signal, no circuit-tracing required. And there's a reason surface signal carries so far: models lean heavily on textual frequency, preferring high-frequency phrasings over rarer but equivalent paraphrases — evidence that a lot of behavior tracks statistical mass rather than deep structure Do language models really understand meaning or just surface frequency?.

The deeper warning is that neither feature-readability nor good metrics guarantees you understand the thing. A model can hold all the linearly decodable features a task needs while its internal organization is quietly fractured — invisible to standard evaluation, exposed only under perturbation Can models be smart without organized internal structure?. That's the argument for not picking a side. Cognitive science already faced this and offers a structured answer: Marr's three levels let you combine behavioral probes, causal interventions, and representational analysis into layered rather than monolithic explanations Can cognitive science methods unlock how LLMs actually work?. Framing the model at the computational level — as an autoregressive probability machine — even predicts *where* it will fail before you open the box at all Can we predict where language models will fail?.

The thing worth taking away: the choice isn't features versus language but where in the stack you intervene. Mechanistic features verify causally but are laborious and can mislead when structure is broken; natural language and linguistic signals scale and read easily but can confabulate. The corpus keeps pointing at the same resolution — pair the inside-out and outside-in views, because each one's blind spot is the other's strength.

Sources 9 notes

Can dictionary learning scale to production language models?

Sparse autoencoders extract high-quality, abstract, multilingual features from Claude 3 Sonnet that both respond to and causally influence model behavior. The work demonstrates interpretability is tractable at production scale, not limited to toy models.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Can natural language explanations redefine what interpretability means?

LLMs' capacity to explain in natural language expands the scale and complexity of patterns conveyable to humans, enabling ambitious new interpretability goals including model-to-model auditing. However, this medium introduces critical risks: hallucinated explanations that feel plausible but lack faithfulness.

Can simple linguistic features detect AI-written arguments?

General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can cognitive science methods unlock how LLMs actually work?

Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

How do mechanistic features compare to natural language for interpretability?

Sources 9 notes

Next inquiring lines