INQUIRING LINE

What other behavioral properties exist as linear directions in activation space?

This explores what behaviors and properties — beyond personality traits — show up as steerable linear directions you can find, monitor, and push on inside a model's activations.


This reads the question as: the linear representation idea works for persona traits, so what else does the corpus find living as a direction in activation space? The answer is more than you'd expect, and the boundaries of the idea are as interesting as the hits.

The clearest sibling to personality is reasoning style. Just as sycophancy and hallucination turn out to be 'persona vectors' you can read off and steer during finetuning Can we track and steer personality shifts during model finetuning?, chain-of-thought verbosity is a single direction: pull one vector extracted from ~50 paired examples and you cut reasoning length by two-thirds without losing accuracy Can we steer reasoning toward brevity without retraining?. So a stylistic property — how much the model rambles — is geometrically separable from what it actually concludes.

But the corpus also pushes back on the assumption that everything is a clean straight line. Syntax isn't stored as a simple direction at all — models encode grammatical relations in polar coordinates, using both distance and angle, which nearly doubles probing accuracy over direction-only methods How do language models encode syntactic relations geometrically?. And meaning itself is tangled: 28 semantic axes collapse into three human-like evaluation dimensions, so intervening on one feature drags its neighbors along with it Do LLM semantic features organize along human evaluation dimensions?. That's the catch with steering — directions are real, but they're rarely independent, so a clean nudge on one trait produces off-target shifts on others.

The sharpest warning comes from work showing that linear decodability can be a mirage. A model can contain every linearly readable feature a task needs while its actual internal organization is fractured and brittle — perfect on the metric, fragile under perturbation Can models be smart without organized internal structure?. The flip side: linear decodability of a task's building blocks reliably predicts whether a model will compositionally generalize Can neural networks learn compositional skills without symbolic mechanisms?. So the presence of a direction tells you something — but whether it means competence or coincidence depends on the rest of the geometry.

The thing you might not have known you wanted: not all of what's encoded is a fixed direction at all. Some properties show up as changes in density rather than location — models sparsify their activations adaptively when they hit unfamiliar, out-of-distribution inputs Do language models sparsify their activations under difficult tasks?, and that sparsity is itself learned from how familiar the training data was Is representational sparsity learned or intrinsic to neural networks?. 'How confident / how familiar' isn't a vector you steer — it's a property of how spread out the representation is. The linear-direction story is one powerful lens, but the corpus suggests behavior is also written in geometry the straight-line picture misses.


Sources 8 notes

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do LLM semantic features organize along human evaluation dimensions?

Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Next inquiring lines