Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
The Representation Engineering (RepE) paper draws an analogy from cognitive neuroscience: the Sherringtonian view (neurons and circuits, bottom-up) vs. the Hopfieldian view (representational spaces, population-level patterns, top-down). Mechanistic interpretability is Sherringtonian — it seeks to reverse-engineer circuits from individual components. RepE is Hopfieldian — it treats high-level concepts as directions in activation space and studies them directly.
The pragmatic argument: while circuit-level analysis has identified specific mechanisms (induction heads, copying circuits), considerable manual effort limits its scope, and strong evidence suggests models compute through iterative refinement across layers rather than discrete circuits. ResNets are robust to layer removal; similar properties appear in LLMs. These findings are "incompatible with a purely circuit-based account."
RepE extracts concepts through Linear Artificial Tomography (LAT) — identifying directions in activation space that correspond to truthfulness, honesty, morality, emotion, power-seeking, and other high-level properties. The reading vectors achieve 90%+ classification accuracy and generalize out-of-distribution.
The experimental framework mirrors neuroscience methodology:
- Correlation — find neural correlates (what LAT does: identify directions that predict concepts)
- Manipulation — establish causation (adding/subtracting reading vectors changes behavior)
- Termination — establish necessity (removing the identified activity degrades performance)
- Recovery — establish sufficiency (reintroducing the activity after removal restores performance)
The lie detection application demonstrates the power: a straightforward detector built from honesty reading vectors identifies deliberate falsehoods, hallucinations, and misleading information. The detector also flags reasoning about lying — thought processes associated with deception, not just deceptive outputs.
This contrasts with SAE-based approaches: Can sparse weight training make neural networks interpretable by design? works at the circuit level (bottom-up), while RepE works at the concept level (top-down). Both are needed — circuits explain how, representations explain what.
ASC extends RepE's steerable dimensions to reasoning style. Can we steer reasoning toward brevity without retraining? demonstrates that reasoning verbosity is a linear direction in activation space, steerable via a single vector extracted from just 50 paired examples. This extends the repertoire beyond the original RepE concept directions (truthfulness, honesty, morality, emotion, power-seeking) to include a behavioral property directly relevant to inference efficiency. The training-free aspect (no fine-tuning, deployment-agnostic) validates RepE's practical case: activation-space steering is becoming a general-purpose behavioral control mechanism with a growing set of addressable dimensions.
Source: MechInterp; enriched from Context Engineering
Related concepts in this collection
-
Can we measure how deeply models represent political ideology?
This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.
ideological depth uses the same principle: directions in representation space correspond to concepts, and their richness indicates depth of encoding
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors are an application of RepE's reading-vector approach to personality traits
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
RepE's manipulation experiments directly test whether encoded concepts causally influence outputs, moving beyond correlation
-
Can we steer reasoning toward brevity without retraining?
This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.
extends RepE's steerable dimensions to reasoning verbosity; validates practical deployment with just 50 paired examples
-
Do reasoning cycles in hidden states reveal aha moments?
What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
a Hopfieldian analysis of reasoning dynamics: extracts graph-theoretic properties (cyclicity, diameter, small-world index) from hidden-state clustering rather than using linear probes; extends RepE's approach from static concept directions to dynamic reasoning process structure
-
Can we decode what LLM activations really represent in language?
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
generalizes RepE from predefined concept directions to open-ended natural language queries over activation space, enabling flexible interpretability without pre-specifying the concepts of interest
-
Can auditors discover what hidden objectives a model learned?
Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
SAE interpretability used in blind audits is an application of RepE's top-down approach to safety evaluation; concept-level probing could extend auditing beyond SAE features to detect hidden objectives through reading vectors
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
representation engineering provides a top-down alternative to bottom-up mechanistic interpretability