Conversational AI Systems Psychology and Social Cognition Language Understanding and Pragmatics

Can speech features be separated into semantic and stylistic components?

Linguistic theory suggests gestures decompose into semantic units and motion variations. Does this decomposition actually emerge in speech encoder layers, and can it enable more expressive gesture synthesis?

Note · 2026-05-03 · sourced from Diffusion LLM

The current bottleneck in creating digital humans is generating character movements that correspond naturally to text or speech inputs. Gesture synthesis is approached either as classification, clustering, or regression. DeepGesture casts it as regression-based prediction conditioned on multimodal signals — text, speech, emotion, and seed motion — and uses a diffusion architecture (built on DiffuseStyleGesture) with two architectural choices that improve semantic alignment and emotional expressiveness.

The first choice is fast text transcription as semantic conditioning, paired with the original speech. Linguistic studies (Kipp 2005, Neff et al. 2008, Webb 1997) show that gestures in daily communication can be decomposed into a limited number of semantic units with various motion variations. DeepGesture splits speech features into two types corresponding to this decomposition: high-level features representing semantic units and low-level features capturing motion variations. The relationships between feature types are learned across different layers of the speech encoder, which effectively disentangles them — different layers carry different feature types.

The second choice is emotion-guided classifier-free diffusion, which supports controllable gesture generation across affective states. Each gesture sequence is labeled with an emotion, and the system supports interpolation between emotional states — a continuous control surface over affect rather than a discrete category selection. The system also generalizes to out-of-distribution speech, including synthetic voices, suggesting the feature decomposition generalizes beyond the training distribution.

The structural insight is that multimodal conditioning works best when the modalities are decomposed into the right granularity. Treating speech as a single feature stream conflates semantic content (which determines what gesture to make) with motion variation (which determines how to make it). Treating gesture as the regression target without distinguishing the two leads to gestures that are semantically appropriate but stylistically flat, or stylistically rich but contextually wrong. The disentanglement is the architectural mechanism that produces both contextual appropriateness and emotional expressiveness simultaneously — analogous to how Can diffusion models enable control that autoregressive models cannot reach? uses diffusion's continuous latents to make global attributes separately controllable.


Source: Diffusion LLM

Related concepts in this collection

Concept map
14 direct connections · 135 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

co-speech gesture decomposes into semantic units and motion variations that disentangle across speech encoder layers — enabling controllable emotional and contextual gesture generation