Do speech models learn language-specific sounds or universal physics?
Exploring whether self-supervised speech models encode phonetic categories tied to specific languages or instead capture the underlying vocal-tract physics common to all humans. This matters for understanding why these models transfer across languages without retraining.
Self-supervised speech models like wav2vec and HuBERT learn from raw audio without phonetic labels, yet their internal representations correlate strongly with articulatory kinematics — the actual movements of the tongue, lips, and vocal folds that produce speech. The hypothesis tested in this work is stronger than correlation: the models infer the causal articulatory processes that generate the acoustic signal. If true, the inference should be language-agnostic, because the human vocal tract is anatomically common across all populations and the acoustics are determined by vocal-tract resonance regardless of which language is being spoken.
This matters because it gives a principled reason for the empirical observation that SSL speech models transfer across languages without retraining. They are not learning language-specific phonetic categories that happen to overlap; they are learning the physics that underlies all human speech production. The phonetic categories of any specific language are projections of that underlying articulatory space, so a model that captures the space natively can be projected onto any language with comparatively little adaptation.
The implication for speech model design is that articulatory inversion should not be treated as a downstream task — it is a window into what the model has already learned. Cho et al. showed that the SSL-articulatory correlation predicts downstream task success, which means articulatory probing is a more informative quality measure than phonetic probing for these models. The articulatory frame also explains why SSL speech outperforms supervised models on low-resource languages: the supervised model lacks the substrate, while SSL has it implicitly. This articulatory substrate is what direct speech-to-speech systems like Can skipping transcription make voice assistants faster? preserve when they bypass transcription — and what What speech tasks remain without standardized benchmarks? argues current benchmarks miss.
Source: Speech Voice
Related concepts in this collection
-
Can skipping transcription make voice assistants faster?
Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?
supports: gives a principled reason direct speech-to-speech outperforms ASR cascades — the encoder carries the articulatory substrate that transcription discards
-
Can speech features be separated into semantic and stylistic components?
Linguistic theory suggests gestures decompose into semantic units and motion variations. Does this decomposition actually emerge in speech encoder layers, and can it enable more expressive gesture synthesis?
extends: same finding that speech encoders capture deep production-level structure, not just phonetic categories — gesture generation discovers the same disentanglement at a different layer
-
What speech tasks remain without standardized benchmarks?
Speech evaluation has strong benchmarks for transcription and translation, but broader comprehension and reasoning tasks over audio lack standardized measurement. This gap may constrain which capabilities researchers prioritize building.
extends: phonetic-probing benchmarks are part of the same evaluation gap — articulatory probing is a richer measure that current benchmarks do not reward
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
speech SSL models infer the causal articulatory processes that generate acoustics — language-agnostic vocal-tract physics underlies multilingual transfer