Can steering vectors prove that representations are genuinely organized?

This explores whether causal interventions like steering vectors — nudging a model's activations in a learned direction to change behavior — actually demonstrate that its internal representations are cleanly structured, or whether they can succeed even when the underlying organization is a mess.

This reads the question as: does the ability to steer or decode a model from its activations prove the representations are genuinely organized? The corpus is unusually pointed here, and the short answer it suggests is no — being able to intervene is not the same as the structure being clean. The sharpest warning comes from work showing that a model can carry every linearly decodable feature a task needs while its internal organization is fundamentally broken Can models be smart without organized internal structure?. A companion line on 'fractured entangled representations' makes the same point from the angle of behavior: two networks can produce identical outputs while one holds tangled, non-reusable internal structure that shatters under small weight perturbations Can identical outputs hide broken internal representations?. So a steering vector that 'works' could be riding on a representation that is nowhere near tidy.

The reason this matters is methodological, and one note states it almost as a rule: representational analysis alone finds correlations without causation, and causal analysis alone shows effects without explaining them — only the two paired together (locate a candidate feature representationally, then verify it causally) produce a real mechanistic claim Can we understand LLM mechanisms with only representational analysis?. A steering vector is the causal half. It can confirm that a direction *does something*, but on its own it can't tell you whether that direction corresponds to an organized, isolated concept or to an entangled shortcut that happens to move behavior.

What does count as evidence of genuine organization? The corpus points to structure you can see in the geometry itself, independent of any single intervention. LLMs encode syntactic relations in a polar-coordinate scheme, using both distance and angle to mark the type and direction of a relation How do language models encode syntactic relations geometrically?. The leading eigenvectors of embedding similarity matrices split a taxonomy coarse-to-fine, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. And pruning experiments show networks really do isolate compositional subroutines into modular subnetworks whose ablation affects only their own function Do neural networks naturally learn modular compositional structure?. That last one is the model for what a steering vector *should* aspire to: the strongest version of a causal claim is an ablation shown to be both necessary and sufficient, as in the weight-sparsity work where interpretable circuits are verified by removing them Can sparse weight training make neural networks interpretable by design?.

The quiet thing worth taking away: 'organization' is not one property, and steering tests only one face of it. A direction can be causally potent and still entangled; structure can be geometrically real and still fragile. The honest test for whether a representation is genuinely organized is convergent — geometry that mirrors a known structure, modules that ablate cleanly, *and* interventions that move behavior in a feature-specific way. Steering vectors are a necessary probe, not a sufficient proof, and the corpus's recurring lesson is that any single lens, decoding or steering included, can pass while the representation underneath stays fractured.

Sources 7 notes

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-evaluating whether steering vectors prove genuine representational organization in LLMs. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as perishable claims:
• A model can achieve identical task performance while holding fundamentally fractured, non-reusable internal structure that shatters under weight perturbation (2025).
• Steering vectors show causality (moving behavior) but cannot alone distinguish organized, isolated concepts from entangled shortcuts (2024–2025).
• Genuine organization requires *convergent* evidence: geometric structure mirroring known taxonomies (e.g., WordNet hierarchies emerging in embedding eigenvectors), clean ablations (modules whose removal affects only their function), *and* feature-specific interventions together (2023–2025).
• LLMs encode syntax using polar coordinates (distance + angle marking relation type and direction), a sign of organized geometric structure independent of steering (2024–2025).
• Weight sparsity isolates interpretable, disentangled circuits verifiable by ablation (necessity + sufficiency), the gold standard for causal claims (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.18654 (2023) — Faith and Fate: compositionality limits
• arXiv:2505.11581 (2025) — Fractured Entangled Representations
• arXiv:2511.13653 (2025) — Weight-sparse transformers have interpretable circuits
• arXiv:2605.23821 (2026) — Hierarchical Concept Geometry

Your task:
(1) RE-TEST EACH CONSTRAINT. Has progress in sparse autoencoders, circuit discovery, or multi-scale analysis since relaxed the gap between steering causality and representational cleanliness? Separate the durable question (what counts as genuine organization?) from perishable claims (steering alone suffices). Cite what moved the needle.
(2) Surface the strongest *disagreement* in the last ~6 months: which recent work challenges the "convergent evidence" bar or argues steering *is* sufficient under specified conditions?
(3) Propose 2 research questions assuming the regime may have shifted—e.g., do scale or training objective reshape the organization–steering relationship?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can steering vectors prove that representations are genuinely organized?

Sources 7 notes

Next inquiring lines