Can feature disentanglement in gesture synthesis generalize to completely unseen voice distributions?

This explores whether splitting speech into separate controllable components for gesture generation actually holds up when the model hears voices it never trained on — and why disentanglement might be the thing that makes that generalization possible.

This explores whether feature disentanglement in gesture synthesis survives contact with voices outside its training distribution. The corpus has a direct answer and, more interestingly, an explanation for why it works. DeepGesture splits speech into high-level semantic features and low-level motion features across different encoder layers, and that separation is exactly what lets it generalize to out-of-distribution synthetic voices Can speech features be separated into semantic and stylistic components?. The intuition: if you've cleanly separated *what is being said* from *how the body should move with the prosody*, then a never-before-heard voice changes the surface acoustics but not the underlying semantic-and-motion structure the model is actually keying on. Disentanglement turns 'unseen voice' from a distribution-shift problem into a recombination of factors the model already understands.

The deeper reason this generalizes shows up in a paper that isn't about gesture at all. Self-supervised speech models tend to learn the language-agnostic *physics* of how a vocal tract produces sound, rather than memorizing language- or speaker-specific phonetic categories Do speech models learn language-specific sounds or universal physics?. That's the same generalization mechanism one layer down: when a model captures the causal process generating the acoustics instead of the acoustic surface itself, a new speaker is just the same process with different parameters. Gesture disentanglement and articulatory inference are two instances of the same bet — recover the generative factors, and unseen distributions stop being scary.

But the corpus also marks the boundary of that bet. Text-only models inherit the abstraction limits of language itself, stripping away the physics and dynamics present in the real signal and producing predictable failures wherever grounding matters Are text-only language models fundamentally limited by abstraction?. Read against gesture synthesis, this is a caution: disentanglement generalizes only over the factors the representation actually encodes. A voice that varies along a dimension the model never separated out — an emotional register, an accent, a speaking style outside the learned space — is a genuinely unseen *factor*, not just an unseen sample, and there the clean recombination story breaks.

There's also a failure mode worth knowing about from the opposite direction. Models don't always preserve the diversity they were trained on: RL post-training has been shown to collapse onto a single dominant distribution and suppress the alternatives within a single epoch Does RL training collapse format diversity in pretrained models?. The lesson for any synthesis pipeline chasing generalization is that *how you train* can quietly destroy the very factor-diversity that disentanglement is supposed to exploit — you can disentangle features and then optimize your way back into a narrow mode. Generalization to unseen voices isn't only an architecture property; it's something training can give and take away.

So the short version: yes, disentanglement in gesture synthesis does generalize to out-of-distribution voices, and the reason is the same reason speech SSL models transfer across languages — they recover generative processes rather than surface samples. The thing you didn't know you wanted to know is that 'unseen voice' has two very different meanings. An unseen *sample* of a known factor generalizes cleanly; an unseen *factor* the representation never separated does not — and the gap between those two is where every disentanglement claim should be stress-tested.

Sources 4 notes

Can speech features be separated into semantic and stylistic components?

DeepGesture's diffusion model splits speech into high-level semantic features and low-level motion features across encoder layers, enabling emotion-guided control. This disentanglement produces gestures that are both contextually appropriate and emotionally expressive, and generalizes to out-of-distribution synthetic voices.

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can feature disentanglement in gesture synthesis generalize to completely unseen voice distributions?

Sources 4 notes

Next inquiring lines