Language Understanding and Reasoning Reasoning and Learning Architectures Reasoning and Knowledge

Do language models use the hierarchical geometry they inherit?

Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.

Note · 2026-05-28 · sourced from MechInterp

The decisive move in the co-occurrence account of concept geometry is a cross-architecture comparison. The hierarchical splitting geometry is first derived and confirmed for word2vec embeddings across many WordNet subtrees. Then the same coarse-to-fine spectral signature is shown to extend "strikingly well" to Gemma 2B unembeddings. Two systems with entirely different objectives and training regimes — a shallow predict-context embedding and a large autoregressive transformer's output matrix — carry the same hierarchical fingerprint. If the structure were a functional artifact of how an LLM reasons, it should not appear, in the same form, in a model that does not reason at all.

This is the strongest available argument that the geometry is statistical, not functional: a shared signature across architectures points to a shared cause upstream of both — the co-occurrence statistics of the training text — rather than convergent functional design. Each word is characterized by discrete, continuous, and hierarchical attributes; words with similar attributes co-occur more often; and that alone gives rise to the geometric organization. Both models inherit it because both are, in different ways, fitting the same pairwise statistics.

Why it leaves a question open: the authors are explicit that such organization may be useful for function but is not driven by it — which leaves unresolved whether and where the LLM actually uses the hierarchical geometry it inherits. Shared structure proves common statistical origin; it does not prove the structure is inert in the transformer. Disentangling inherited-but-unused geometry from inherited-and-recruited geometry is the open problem this result sharpens rather than settles.

— "Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence", https://arxiv.org/abs/2605.23821

Related concepts in this collection

Where does hierarchical structure in language models come from? Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.
the cross-architecture match is the evidence for the structure-without-mechanism claim
Do embedding eigenvectors organize taxonomy from coarse to fine? Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.
the specific signature shown to be shared between word2vec and Gemma
Do standard analysis methods hide nonlinear features in neural networks? Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
sharpens the open question — detected structure need not be the structure the model computes with

Concept map

12 direct connections · 92 in 2-hop network ·medium cluster Open in graph ↗

Do language models use the hierarchical geometry… Where does hierarchical structure in language mode… Do embedding eigenvectors organize taxonomy from c… Do standard analysis methods hide nonlinear featur…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

word2vec and gemma unembeddings share the same hierarchical signature so structure is statistical not functional

Do language models use the hierarchical geometry they inherit?

Related concepts in this collection

Related papers in this collection