Do speech encoders actually learn the physics of how vocal tracts produce sound?
This explores whether self-supervised speech models learn the underlying mechanics of speech production — how the vocal tract physically shapes sound — rather than memorizing the specific sounds of particular languages.
This explores whether speech encoders capture the physics of sound production rather than just language-specific sound categories — and the corpus has a surprisingly direct answer. Self-supervised speech models appear to infer the causal articulatory processes that generate acoustics: the language-agnostic physics of how a vocal tract moves air, not the phonetic inventory of any one language Do speech models learn language-specific sounds or universal physics?. The tell is multilingual transfer — a model trained on one language carries over to others, which is hard to explain if it learned 'the sounds of English' but easy if it learned 'how human mouths make any sound.' That finding even predicts downstream task performance better than probing for phonetic categories, which suggests the production physics is the more fundamental thing the model is tracking.
What makes this striking is the contrast with text. A parallel line in the corpus argues that text-only language models are stuck in Plato's cave: text is a lossy abstraction that strips out the physics, geometry, and causality of the world it describes, leaving models to shuffle symbols with no grounding in the dynamics that produced them Are text-only language models fundamentally limited by abstraction?. Speech sits closer to the source. The acoustic signal is a physical trace of a physical process, so a model learning to predict it has a path back to the generating mechanism that a text model simply doesn't — the articulatory cause is recoverable from the data in a way that a word's referent is not.
There's an important caveat the corpus raises elsewhere, though: learning to represent something is not the same as using it. Studies repeatedly show that models can encode information in their internal representations while that information fails to causally influence their outputs Do language models actually use their encoded knowledge?. So 'the encoder represents articulatory structure' and 'the encoder's behavior is driven by articulatory structure' are different claims. The speech-SSL finding is stronger precisely because it points to *causal* articulatory processes and ties them to behavior — but the broader encoding-vs-usage gap is the right skepticism to hold when someone says a model 'learned the physics.'
It's also worth seeing what 'learning physics' is being contrasted against. Another thread argues that language models can master meaning as pure relational structure — Saussure's langue — compressing how symbols relate to each other without ever touching the world Can language models learn meaning without engaging the world?. Speech encoders seem to be doing the opposite: instead of a closed web of relations, they recover an external generative mechanism. That's the deeper payoff of the question. The same architecture, fed text, becomes a relational symbol-manipulator; fed raw audio, it reaches for the causal machinery behind the signal. The modality of the data, not the model, decides whether 'physics' is even on the table.
Sources 4 notes
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Multiple studies confirm that language models can encode facts in their representations while those facts fail to causally affect downstream outputs. Encoding and usage are distinct processes.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.