Can speech embeddings carry articulatory structure that text cannot?
This explores whether self-supervised speech embeddings capture the physical machinery of how sounds are produced — the vocal tract's articulatory process — in a way that text, as a symbolic abstraction, structurally cannot.
This explores whether speech embeddings carry something text loses: the physics of how sound is made. The corpus gives a fairly direct yes. Self-supervised speech models appear to learn the causal articulatory process behind acoustics — the language-agnostic mechanics of how the vocal tract shapes air into sound — rather than memorizing language-specific phonetic categories Do speech models learn language-specific sounds or universal physics?. The tell is generalization: this articulatory grounding predicts downstream performance and multilingual transfer better than phonetic probing does, which is what you'd expect if the model has latched onto the generative physics rather than surface labels.
The reason text can't do this is the more interesting half. Text is a lossy human abstraction — it strips the physics, geometry, and causality present in the original signal, leaving language models to shuffle symbols without contact with the dynamics that produced them Are text-only language models fundamentally limited by abstraction?. Articulation is exactly the kind of source dynamic that gets compressed away when speech becomes a written string. A related argument sharpens the point: meaning (and by extension, grounded structure) requires a relation between a form and what generated it, not just form-to-form prediction — text training has no access to that relation Can language models learn meaning from text patterns alone?. Speech embeddings, trained on the acoustic signal itself, sit one step closer to the source.
The lateral surprise is that embeddings in general turn out to be richer than their training objective suggests. Static text embeddings already encode psycholinguistic dimensions you wouldn't expect from pure co-occurrence — including iconicity, the degree to which a word's sound resembles its meaning Do transformer static embeddings actually encode semantic meaning?. Iconicity is a faint echo of articulatory structure leaking back into text. And models spontaneously build structured, near-symbolic geometry from raw signal — encoding syntactic type and direction in polar coordinates within their activations How do language models encode syntactic relations geometrically?. If text models manufacture that much latent structure from impoverished input, speech models fed the actual acoustics have far more articulatory structure available to organize.
The boundary isn't absolute, though. Language models can perform genuine phonological analysis through step-by-step reasoning — constructing valid generalizations about sound patterns rather than just parroting them Can language models actually analyze language structure?. So text models can reason *about* articulation symbolically even if they can't *embed* it. The distinction worth carrying away: speech embeddings carry articulatory structure implicitly, as learned physics; text can only ever describe it explicitly, as inferred rules. Two different ways of knowing the same vocal tract — one from the inside of the signal, one from the outside.
Sources 6 notes
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.