Why do handcrafted acoustic features outperform neural speaker embeddings for personality?

This explores why simple, hand-measured sound properties (pitch, loudness, timing) beat learned neural speaker representations when predicting personality from voice — and what that says about where personality actually lives in speech.

This explores why simple, hand-measured sound properties beat learned neural speaker embeddings at reading personality from a voice — and the corpus suggests the answer is about *what* each method captures, not how powerful it is. The clearest evidence comes from work showing that handcrafted acoustic features outperform neural embeddings for personality, alongside the finding that the same acoustic cue flips meaning by context: a feature signaling extraversion in a calm interview predicts neuroticism under stress Does personality sound the same in stressful and neutral conversations?. The takeaway is that personality isn't a fixed 'sound' a speaker carries around — it's conveyed through specific, measurable behaviors that shift with the situation. Handcrafted features isolate exactly those behaviors; neural speaker embeddings are trained to capture holistic speaker *identity* (who is talking), which is the wrong target. They're optimized to be stable across contexts, so they smooth away the very contextual variation that personality lives in.

There's a deeper architectural reason hiding in the speech-model corpus. Self-supervised speech embeddings learn the language-agnostic *physics* of how a vocal tract produces sound — the causal articulatory processes behind the acoustics, not the social signals layered on top Do speech models learn language-specific sounds or universal physics?. That's why they transfer brilliantly across languages: they've abstracted toward the invariant, speaker-independent mechanism. But personality perception needs the opposite — the moment-to-moment behavioral grain that handcrafted features measure directly. A representation that's excellent at 'how speech is physically generated' can be poor at 'how this person is behaving right now,' because it has learned to discard exactly that variation as noise.

This rhymes with a pattern that runs across the whole collection: traits tend to be carried by specific, locatable signals rather than diffuse holistic style. In language models, personality turns out to be steerable through narrow linear directions in activation space Can we track and steer personality shifts during model finetuning?, a low-dimensional 'distance from the default Assistant' axis How stable is the trained Assistant personality in language models?, and tiny per-layer adapters touching under 0.1% of parameters Can we control personality in language models without prompting?. Even hidden trait transmission between models works through specific statistical signatures rather than holistic semantic content Can language models transmit hidden behavioral traits through unrelated data?. The recurring lesson: personality is sparse and specific, so methods that target concrete features win over methods that learn one big blended representation.

The irony is that 'handcrafted beats neural' usually loses as datasets grow — but here it persists because the handcrafted features encode a correct prior (personality = specific behaviors) that the embedding objective actively works against. If you wanted the doorway into this: the situational-variation finding is where to start Does personality sound the same in stressful and neutral conversations?, and the articulatory-physics result explains *why* the powerful representation is aimed at the wrong thing Do speech models learn language-specific sounds or universal physics?.

Sources 6 notes

Does personality sound the same in stressful and neutral conversations?

Acoustic features that signal extraversion in neutral interviews instead predict neuroticism under stress. Handcrafted acoustic features outperform neural embeddings, suggesting personality is conveyed through specific measurable behaviors rather than holistic speaker style.

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Why do handcrafted acoustic features outperform neural speaker embeddings for personality?

Sources 6 notes

Next inquiring lines