Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
To understand such utility, the internal representation of speech SSL models has been scrutinized by probing analyses for known speech and linguistic features, such as low-level acoustics, phonetics, and lexical semantics [2, 3, 4, 5]. A comparative analysis by Cho et al. [5] demonstrates that the state-of-the-art SSL models are highly correlated with articulatory kinematics and the correlation score can indicate the success of the SSL model in downstream tasks. This finding is extended to developing a high-performance Acoustic-to-Articulatory inversion (AAI) model [6].
Here, we test an intriguing hypothesis – speech SSL models infer the causal articulatory processes that generate the speech acoustic signal. If this hypothesis is true, such inference should be agnostic to language or dialect, as speech acoustics are a result of the resonance associated with vocal tract shapes that add spectral detail to the vocal source, i.e., vibration of the vocal folds. Besides, the human species has a common canonical structure of vocal tract anatomy and orofacial muscles, regardless of ethnic group, language, and dialect.