Can speech features be separated into semantic and stylistic components?
Linguistic theory suggests gestures decompose into semantic units and motion variations. Does this decomposition actually emerge in speech encoder layers, and can it enable more expressive gesture synthesis?
The current bottleneck in creating digital humans is generating character movements that correspond naturally to text or speech inputs. Gesture synthesis is approached either as classification, clustering, or regression. DeepGesture casts it as regression-based prediction conditioned on multimodal signals — text, speech, emotion, and seed motion — and uses a diffusion architecture (built on DiffuseStyleGesture) with two architectural choices that improve semantic alignment and emotional expressiveness.
The first choice is fast text transcription as semantic conditioning, paired with the original speech. Linguistic studies (Kipp 2005, Neff et al. 2008, Webb 1997) show that gestures in daily communication can be decomposed into a limited number of semantic units with various motion variations. DeepGesture splits speech features into two types corresponding to this decomposition: high-level features representing semantic units and low-level features capturing motion variations. The relationships between feature types are learned across different layers of the speech encoder, which effectively disentangles them — different layers carry different feature types.
The second choice is emotion-guided classifier-free diffusion, which supports controllable gesture generation across affective states. Each gesture sequence is labeled with an emotion, and the system supports interpolation between emotional states — a continuous control surface over affect rather than a discrete category selection. The system also generalizes to out-of-distribution speech, including synthetic voices, suggesting the feature decomposition generalizes beyond the training distribution.
The structural insight is that multimodal conditioning works best when the modalities are decomposed into the right granularity. Treating speech as a single feature stream conflates semantic content (which determines what gesture to make) with motion variation (which determines how to make it). Treating gesture as the regression target without distinguishing the two leads to gestures that are semantically appropriate but stylistically flat, or stylistically rich but contextually wrong. The disentanglement is the architectural mechanism that produces both contextual appropriateness and emotional expressiveness simultaneously — analogous to how Can diffusion models enable control that autoregressive models cannot reach? uses diffusion's continuous latents to make global attributes separately controllable.
Source: Diffusion LLM
Related concepts in this collection
-
Can diffusion models enable control that autoregressive models cannot reach?
Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?
extends: same diffusion-control story applied to multimodal output — disentangled latent layers enable attribute control without retraining
-
Should emotion AI estimate intensity instead of assigning labels?
Explores whether emotion AI systems should measure continuous intensity across multiple emotions rather than forcing single-label classification. This matters because the theoretical foundation—how emotions actually work—may determine which approach is more accurate.
tension: DeepGesture treats emotion as a labeled-and-interpolatable axis; constructed-emotion theory says affect is contextually assembled rather than universal — interpolation may smooth over what should be re-estimated per context
-
What makes linguistic agency impossible for language models?
From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.
tension: gesture synthesis is a surface analogue for embodiment but not the participatory embodiment enactivism requires — produced gesture is not lived gesture
-
Can AI systems read cognitive state from interaction patterns alone?
Explores whether behavioral telemetry—gaze, typing hesitation, interaction speed—can serve as a reliable continuous signal of user cognitive state without explicit self-report, and what design constraints this imposes.
complements: input-side parallel — DeepGesture generates gesture from speech; multimodal cues read state from behavior
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
co-speech gesture decomposes into semantic units and motion variations that disentangle across speech encoder layers — enabling controllable emotional and contextual gesture generation