Can articulatory inversion serve as a window into what speech models have learned?

This reads articulatory inversion — reconstructing the vocal-tract movements behind a sound — as an interpretability probe: a way to ask what a speech model actually represents under the hood, and the corpus suggests it works because the model already learned the articulation, not just the acoustics.

This explores whether articulatory inversion — recovering how the mouth and vocal tract moved to produce a sound — can act as a window into what speech models have internally learned, rather than just a speech-engineering trick. The most direct evidence in the collection says yes, and for a surprising reason: self-supervised speech models appear to infer the causal articulatory processes that generate acoustics, not language-specific phonetic categories Do speech models learn language-specific sounds or universal physics?. If a model has secretly reconstructed the physics of the vocal tract, then probing it for articulation isn't testing whether it can do a task — it's reading out a representation it built on its own. Notably, that work found articulatory probing predicts downstream performance *better* than phonetic probing, which is the real claim hiding in your question: the better window is the one that matches what the model actually encoded, not the labels humans find intuitive.

That reframes inversion as a member of a larger family of interpretability moves the corpus keeps returning to — the search for the unit of analysis that reveals a model's internal commitments. In language models, sparse autoencoders surfaced an entity-recognition mechanism that the model uses to track its own knowledge and steer hallucination versus refusal Do models know what they don't know?. The parallel is tight: in both cases the informative probe is causal (it steers behavior or generates the signal) rather than correlational. Articulatory inversion is the speech-domain version of finding that causal latent variable.

But a window only shows you what's behind it, and the collection has a caution about assuming there's a stable object back there at all. Transformers may transmit knowledge as continuous flow rather than fixed storage, closer to oral performance than to a retrievable archive Do transformer models store knowledge or generate it continuously? — which means an inverted articulatory trajectory might be a snapshot of a process, not a lookup of a stored phoneme. Relatedly, hidden states reorganize under pressure: representations sparsify adaptively when a task drifts out of distribution Do language models sparsify their activations under difficult tasks?. So what inversion reveals could shift depending on whether the input is familiar speech or an accent the model has never heard — the window's view changes with the weather.

There's also a cross-domain warning worth carrying over. Probing assumes the model committed to one thing you can read out, but language models can hold a superposition and only sample a particular character at generation time Do large language models actually commit to a single character?. Ported to speech, this suggests a single acoustic frame might be consistent with several articulatory configurations the model is implicitly holding open — so inversion may recover a distribution over vocal-tract states rather than the one true gesture. That's not a failure of the window; it's the window honestly showing you that the model's knowledge is probabilistic.

The thing you may not have known you wanted to know: articulatory inversion is interesting *not* because it lets us decode speech better, but because it's an accidental confession. A model trained only on raw audio, never told about lips or tongues, reconstructs the bodily mechanics anyway — which says the most learnable structure in speech is the physical act that made it. The window doesn't just show what the model learned; it shows what was there to be learned in the first place.

Sources 5 notes

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can articulatory inversion serve as a window into what speech models have learned?

Sources 5 notes

Next inquiring lines