Are static embeddings analogous to the formal linguistic competence layer?
This explores whether the static word-embedding layer of a transformer — what each token 'means' before attention mixes anything together — maps onto the 'formal linguistic competence' idea: a model's grasp of language structure as separate from using language to reason about the world.
This reads the question through the split that linguists draw between *formal* competence (knowing the rules and meanings of a language) and *functional* competence (using language to think and act in the world) — and asks where static embeddings sit. The corpus suggests static embeddings are real linguistic knowledge, but they're closer to the *lexicon* than to the formal-grammar layer the question names.
The strongest evidence that embeddings carry genuine knowledge comes from clustering analysis of RoBERTa's static vectors, which turn out to be sensitive to psycholinguistic properties like valence, concreteness, and even taboo — meaning each word arrives at the model already loaded with semantic content before self-attention does anything Do transformer static embeddings actually encode semantic meaning?. So embeddings aren't empty slots waiting to be filled by context; they function as standalone lexical entries. That's a competence layer in the sense that the knowledge is *in there* — but it's word-level meaning, not the structural machinery of grammar.
Where the analogy strains is structure. If static embeddings were the formal competence layer, you'd expect models to handle syntax cleanly. They don't: top models systematically misread embedded clauses, complex verb phrases, and nested nominals, and the errors get predictably worse as syntactic depth increases Why do large language models fail at complex linguistic tasks?. That points to the real formal-competence work happening in *composition* — what attention does across layers — not in the embedding lookup. The embeddings supply the pieces; the failures show up when the pieces have to be assembled by grammatical rules the model only approximates statistically.
There's a deeper framing here worth surfacing. One line of work argues LLMs operationalize Saussure's *langue* — they learn meaning purely as relational structure compressed from text, with no external referents Can language models learn meaning without engaging the world?. Under that view the whole model, embeddings included, is a formal-competence engine: it masters the internal relations of language without grounding. Static embeddings would then be the most relational layer of all — meaning defined entirely by neighbors. And efforts to move reasoning *up* to sentence-level embeddings, as in Large Concept Models, are essentially a bet that the formal/relational layer can carry abstraction on its own, language-agnostically Can reasoning happen at the sentence level instead of tokens?.
The thing you might not have expected to learn: the formal-competence analogy holds best precisely where models look most impressive and breaks where they look smartest. Embeddings nail word meaning and relational structure (formal), but the same architecture defaults to surface heuristics the moment a task needs genuine world- or mind-modeling — theory-of-mind benchmarks show models faking perspective-taking rather than tracking beliefs Do large language models genuinely simulate mental states?. So static embeddings aren't the whole formal layer — they're its lexical floor, and the gap they expose is exactly the formal-vs-functional boundary itself.
Sources 5 notes
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.