Language Understanding and Pragmatics LLM Reasoning and Architecture Psychology and Social Cognition

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

Note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Representation Engineering (RepE) paper draws an analogy from cognitive neuroscience: the Sherringtonian view (neurons and circuits, bottom-up) vs. the Hopfieldian view (representational spaces, population-level patterns, top-down). Mechanistic interpretability is Sherringtonian — it seeks to reverse-engineer circuits from individual components. RepE is Hopfieldian — it treats high-level concepts as directions in activation space and studies them directly.

The pragmatic argument: while circuit-level analysis has identified specific mechanisms (induction heads, copying circuits), considerable manual effort limits its scope, and strong evidence suggests models compute through iterative refinement across layers rather than discrete circuits. ResNets are robust to layer removal; similar properties appear in LLMs. These findings are "incompatible with a purely circuit-based account."

RepE extracts concepts through Linear Artificial Tomography (LAT) — identifying directions in activation space that correspond to truthfulness, honesty, morality, emotion, power-seeking, and other high-level properties. The reading vectors achieve 90%+ classification accuracy and generalize out-of-distribution.

The experimental framework mirrors neuroscience methodology:

  1. Correlation — find neural correlates (what LAT does: identify directions that predict concepts)
  2. Manipulation — establish causation (adding/subtracting reading vectors changes behavior)
  3. Termination — establish necessity (removing the identified activity degrades performance)
  4. Recovery — establish sufficiency (reintroducing the activity after removal restores performance)

The lie detection application demonstrates the power: a straightforward detector built from honesty reading vectors identifies deliberate falsehoods, hallucinations, and misleading information. The detector also flags reasoning about lying — thought processes associated with deception, not just deceptive outputs.

This contrasts with SAE-based approaches: Can sparse weight training make neural networks interpretable by design? works at the circuit level (bottom-up), while RepE works at the concept level (top-down). Both are needed — circuits explain how, representations explain what.

ASC extends RepE's steerable dimensions to reasoning style. Can we steer reasoning toward brevity without retraining? demonstrates that reasoning verbosity is a linear direction in activation space, steerable via a single vector extracted from just 50 paired examples. This extends the repertoire beyond the original RepE concept directions (truthfulness, honesty, morality, emotion, power-seeking) to include a behavioral property directly relevant to inference efficiency. The training-free aspect (no fine-tuning, deployment-agnostic) validates RepE's practical case: activation-space steering is becoming a general-purpose behavioral control mechanism with a growing set of addressable dimensions.


Source: MechInterp; enriched from Context Engineering

Related concepts in this collection

Concept map
20 direct connections · 139 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

representation engineering provides a top-down alternative to bottom-up mechanistic interpretability