Can steering vectors prove that representations are genuinely organized?
This explores whether causal interventions like steering vectors — nudging a model's activations in a learned direction to change behavior — actually demonstrate that its internal representations are cleanly structured, or whether they can succeed even when the underlying organization is a mess.
This reads the question as: does the ability to steer or decode a model from its activations prove the representations are genuinely organized? The corpus is unusually pointed here, and the short answer it suggests is no — being able to intervene is not the same as the structure being clean. The sharpest warning comes from work showing that a model can carry every linearly decodable feature a task needs while its internal organization is fundamentally broken Can models be smart without organized internal structure?. A companion line on 'fractured entangled representations' makes the same point from the angle of behavior: two networks can produce identical outputs while one holds tangled, non-reusable internal structure that shatters under small weight perturbations Can identical outputs hide broken internal representations?. So a steering vector that 'works' could be riding on a representation that is nowhere near tidy.
The reason this matters is methodological, and one note states it almost as a rule: representational analysis alone finds correlations without causation, and causal analysis alone shows effects without explaining them — only the two paired together (locate a candidate feature representationally, then verify it causally) produce a real mechanistic claim Can we understand LLM mechanisms with only representational analysis?. A steering vector is the causal half. It can confirm that a direction *does something*, but on its own it can't tell you whether that direction corresponds to an organized, isolated concept or to an entangled shortcut that happens to move behavior.
What does count as evidence of genuine organization? The corpus points to structure you can see in the geometry itself, independent of any single intervention. LLMs encode syntactic relations in a polar-coordinate scheme, using both distance and angle to mark the type and direction of a relation How do language models encode syntactic relations geometrically?. The leading eigenvectors of embedding similarity matrices split a taxonomy coarse-to-fine, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. And pruning experiments show networks really do isolate compositional subroutines into modular subnetworks whose ablation affects only their own function Do neural networks naturally learn modular compositional structure?. That last one is the model for what a steering vector *should* aspire to: the strongest version of a causal claim is an ablation shown to be both necessary and sufficient, as in the weight-sparsity work where interpretable circuits are verified by removing them Can sparse weight training make neural networks interpretable by design?.
The quiet thing worth taking away: 'organization' is not one property, and steering tests only one face of it. A direction can be causally potent and still entangled; structure can be geometrically real and still fragile. The honest test for whether a representation is genuinely organized is convergent — geometry that mirrors a known structure, modules that ablate cleanly, *and* interventions that move behavior in a feature-specific way. Steering vectors are a necessary probe, not a sufficient proof, and the corpus's recurring lesson is that any single lens, decoding or steering included, can pass while the representation underneath stays fractured.
Sources 7 notes
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.