Can representation engineering cleanly isolate single features in entangled semantic space?

This explores whether we can cleanly pull out one concept at a time from a model's internal representations — or whether features are so tangled together that surgical isolation is a fantasy.

This explores whether we can cleanly pull out one concept at a time from a model's internal representations, or whether meaning is so braided together inside the network that surgical isolation is a fantasy. The corpus suggests the honest answer is: sometimes, but only when you engineer for it up front — and the default state of a trained network works against you.

The pessimistic evidence comes from how representations actually form. A model can carry every linearly-decodable feature it needs and still be internally fractured: two networks with identical accuracy can have fundamentally different, broken internal organization that only shows up under perturbation or distribution shift Can models be smart without organized internal structure?. That's the entanglement problem in a nutshell — clean behavior on the outside tells you nothing about whether features are cleanly separated on the inside. Worse, traits and behaviors can ride along in statistical signatures that have no semantic relationship to the thing being transmitted at all, surviving aggressive filtering precisely because they aren't localized to anything human-legible Can language models transmit hidden behavioral traits through unrelated data?. If a behavior can hide in data that looks unrelated, it can hide in activation space the same way.

The optimistic counter-thread says isolation is achievable — but you usually have to force it during training rather than recover it afterward. Training transformers with sparse weights produces compact, human-interpretable circuits where individual neurons map to simple concepts, and ablation confirms those circuits are both necessary and sufficient for the task Can sparse weight training make neural networks interpretable by design?. That's disentanglement by construction, the cleanest case the corpus offers — with the catch that it has only been shown to hold below tens of millions of parameters. So the cleanliness you want may be inversely related to the scale you care about.

There's also a deeper reason features resist isolation: a lot of what models encode is geometric and relational, not a list of separable switches. Syntactic relations live in a polar coordinate system — type and direction encoded jointly through distance and angle, where reading only one axis halves your accuracy How do language models encode syntactic relations geometrically?. Embedding structure splits taxonomy coarse-to-fine along leading eigenvectors, so a single 'feature' is often a band in a spectrum rather than a discrete unit Do embedding eigenvectors organize taxonomy from coarse to fine?. When meaning is carried by relative position in a continuous space, 'one feature' may not even be the right unit to isolate.

The useful turn for a curious reader: representation engineering's real payoff may be less about clean extraction and more about effective intervention. When a model ignores its context because strong training-time priors dominate, prompting can't fix it — but a causal intervention directly in the representations can Why do language models ignore information in their context?. You may never get a perfectly isolated 'feature,' yet still be able to reach in and steer the entangled bundle. Isolation and control are different goals, and the corpus suggests the second is the more reachable one.

Sources 6 notes

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can representation engineering cleanly isolate single features in entangled semantic space?

Sources 6 notes

Next inquiring lines