Can single-vector embeddings capture non-commutative relationships like word order?

This explores whether a single fixed-length vector can encode order-dependent meaning — the difference between 'dog bit man' and 'man bit dog' — or whether that's a geometric impossibility baked into how embeddings work.

This explores whether a single fixed-length vector can encode order-dependent meaning — the difference between 'dog bit man' and 'man bit dog' — and the corpus has a sharp, mostly-discouraging answer with an interesting caveat. The cleanest result is geometric: unit-sphere cosine spaces force concepts into linear superposition, and superposition is *commutative* — adding 'dog' + 'bit' + 'man' lands in the same place no matter the order Why can't cosine space retrievers distinguish word order?. So a vector that lives on a sphere and is compared by cosine similarity is, by construction, blind to the very thing word order encodes. The finding is that this isn't a training failure you can fix with more data — it's structural, and escaping it requires architectural moves like token-level interaction (think late-binding retrieval) or a downstream verification step rather than a single pooled vector.

That connects to a quieter problem with what embeddings actually measure. They capture *semantic association* — what co-occurs — not roles or relevance Do vector embeddings actually measure task relevance?. 'Dog,' 'bit,' and 'man' are all strongly associated regardless of who did the biting, so an association-based vector has no native handle on subject-vs-object. This is also why LLMs degrade so predictably on syntax: top models misidentify embedded clauses and verb phrases, and the errors get *worse* as structural depth increases, which is exactly what you'd expect if the system learned surface co-occurrence rather than grammatical structure Why do large language models fail at complex linguistic tasks?.

The caveat — and the part you didn't know you wanted — is that the contextualized activations *inside* a transformer do better than the static pooled embedding at the door. The Polar Probe shows models encode syntactic relations not just by distance but by *angle*: a polar-coordinate geometry where direction carries the type and orientation of a relation, nearly doubling accuracy over distance-only methods How do language models encode syntactic relations geometrically?. Direction is the trick that lets geometry become non-commutative — A→B is not B→A. So the limitation is less 'neural nets can't represent order' and more 'a single cosine-compared vector throws the directional information away when it collapses a sequence into one point.'

There's a representational-substrate angle too. Even static embeddings, before attention runs, are richer than 'just a lookup' — they carry valence, concreteness, and other lexical content Do transformer static embeddings actually encode semantic meaning? — and networks can spontaneously carve compositional tasks into modular subnetworks that handle pieces independently Do neural networks naturally learn modular compositional structure?. That suggests the machinery to represent structured, order-sensitive composition exists; it's the final pooling-into-one-vector-and-comparing-by-cosine step that destroys it.

The practical upshot for anyone building retrieval or search: if your queries hinge on order, negation, or who-did-what-to-whom, a single dense embedding will quietly conflate opposites, and no amount of fine-tuning fixes the geometry. The corpus points you toward multi-vector / token-level interaction or a verification pass on top — the same conclusion the cosine-space note reaches from first principles.

Sources 6 notes

Why can't cosine space retrievers distinguish word order?

Unit-sphere cosine spaces force concepts into linear superposition, a commutative structure that cannot robustly represent non-commutative distinctions like "dog bit man" versus "man bit dog." This geometric constraint persists regardless of training procedure and requires architectural alternatives like token-level interaction or downstream verification.

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can single-vector embeddings capture non-commutative relationships like word order?

Sources 6 notes

Next inquiring lines