Can spectral eigenvector ordering serve as a model-agnostic interpretability probe?

This explores whether the order of an embedding matrix's eigenvectors — its spectral structure — could work as a general-purpose, architecture-independent tool for reading what a model has learned, and what the corpus says about the promise and limits of that idea.

This explores whether spectral eigenvector ordering — sorting the eigenvectors of an embedding's covariance or Gram matrix and reading meaning off their sequence — could be a model-agnostic interpretability probe. The corpus offers one strong piece of encouragement and several sharp cautions. The encouraging result: when you take the leading eigenvectors of embedding Gram matrices, they split a concept taxonomy coarse-to-fine, peeling off broad branches first and finer sub-branches later, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. That's exactly the dream of a spectral probe: ordering carries semantic information, and it does so in a way predicted from co-occurrence statistics rather than hand-tuned per model — a genuinely model-agnostic signal.

But the corpus immediately complicates the 'model-agnostic' part. Standard representation-analysis methods — PCA, linear regression, RSA, all spectral or linear in spirit — are systematically biased toward simple linear features and under-represent equally important nonlinear ones Do standard analysis methods hide nonlinear features in neural networks?. The same work shows networks can compute perfectly while having no interpretable activation structure at all, which means a clean eigenvector ordering could exist in one model and be absent in another that solves the identical task. That non-uniqueness is the deep problem: two models with identical performance can carry fundamentally different — even fractured — internal organization invisible to standard metrics Can models be smart without organized internal structure?. A probe that reads spectral order would report cleanly on the organized model and mislead you on the fractured one, yet both pass evaluation. So 'model-agnostic' in the sense of 'runs on any model' is cheap; 'model-agnostic' in the sense of 'returns trustworthy meaning on any model' is not guaranteed.

There's also a question of what geometry the ordering even captures. The Polar Probe finds that syntax lives not in distance alone but in angular position — type and direction encoded together in a polar geometry that distance-only methods miss by nearly half How do language models encode syntactic relations geometrically?. Eigenvector ordering is fundamentally a distance/variance story, so it may be blind to exactly the directional structure that carries relational meaning. And sparsity work suggests the informative structure isn't always sitting in the top eigendirections at all: representational density is learned and data-dependent, dense for familiar inputs and sparse for unfamiliar ones Is representational sparsity learned or intrinsic to neural networks?, while last-layer activation sparsity is itself an orderable difficulty signal Can representation sparsity order few-shot demonstrations effectively?. The spectrum you'd probe shifts with the data you feed it.

The corpus's strongest meta-lesson is that no representational probe — spectral or otherwise — is self-sufficient. Mechanistic understanding requires pairing representational analysis with causal intervention: spectral ordering can locate a candidate feature, but only ablation or intervention confirms it actually drives behavior Can we understand LLM mechanisms with only representational analysis?. Pruning and ablation studies repeatedly show networks decomposing into modular subnetworks where you can verify a structure is necessary and sufficient Do neural networks naturally learn modular compositional structure?, and weight-sparsity training can even build interpretability in by design Can sparse weight training make neural networks interpretable by design? — both routes that close the loop spectral ordering leaves open.

Where does that leave the inquiring reader? The most useful reframe in the corpus is Marr's three levels: treat spectral ordering as an implementation/algorithmic-level lens rather than a complete explanation, and layer it with behavioral and causal probes instead of asking it to stand alone Can cognitive science methods unlock how LLMs actually work?. The taxonomy result shows spectral ordering can be a real, transferable doorway into what a model has organized; the bias and fractured-representation results show it's a doorway that sometimes opens onto an empty room. As a model-agnostic *first pass* — cheap, label-free, comparable across architectures — it's promising. As a standalone verdict on what a model knows, the corpus says: don't trust ordering you haven't intervened on.

Sources 10 notes

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Can cognitive science methods unlock how LLMs actually work?

Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.

Can spectral eigenvector ordering serve as a model-agnostic interpretability probe?

Sources 10 notes

Next inquiring lines