DO THEY SEE WHAT WE SEE?

Paper · Source

Introducing EMONET-FACE & EMONET-VOICE: A New Foundation

To address these challenges, we developed the EMONET suites. At their core is a novel 40-category emotion taxonomy, meticulously derived from an extensive analysis of the "Handbook of Emotions" and refined through consultation with psychologists. This taxonomy moves far beyond basic emotions, encompassing a rich spectrum of positive and negative affective states, cognitive states (e.g., Concentration, Confusion, Doubt), physical states (e.g., Pain, Fatigue, Intoxication), and socially mediated emotions (e.g., Embarrassment, Shame, Pride, Teasing). This granularity is crucial for building AI that can appreciate the finer details of human emotional life.

EMONET-FACE

EMONET-FACE provides a rich resource for visual emotion understanding:

EMONET-FACE BIG (over 203,000 synthetic images) offers a vast dataset for pre-training models.

EMONET-FACE BINARY (approx. 20,000 images) is designed for fine-tuning and features over 62,000 binary (present/absent) emotion annotations from human experts. These annotations underwent a rigorous multi-stage process, requiring triple positive agreement for affirmative labels and a contrastive batch to ensure high-quality true negatives.

EMONET-FACE HQ (2,500 images) serves as our gold-standard evaluation benchmark. Each image was meticulously rated by multiple psychology experts on a continuous 0-7 intensity scale across all 40 emotion categories, resulting in 10,000 expert annotations.

The synthetic images were generated using state-of-the-art text-to-image models with explicit prompts to ensure diverse demographic representation (across ethnicity, age, and gender) and clear, full-face expressions. This approach not only allows for controlled diversity but also sidesteps the ethical concerns associated with using real individuals' images.

Why Emotion in Speech and Face Matters: The Vision of Universal Voice Actors

Effective communication transcends mere words. It's woven with the rich threads of emotion, conveyed through the subtle shifts in our facial expressions and the intricate nuances of our voice. Capturing these expressions enables AI assistants to become more empathetic, engaging, and supportive; qualities crucial for transformative applications in education, mental health, companionship, and beyond.

We envision a future where multimodal foundation models evolve into "omni-models" with sophisticated audio-in/audio-out capabilities. Soon, every new foundation model on platforms like Hugging Face could be capable of performing voice acting like Robert De Niro or Scarlett Johansson. These AI systems will function like world-class voice actors, capable of being prompted not just by text, but also by voice, to adopt any persona. Imagine an AI that can embody an empathetic educator adapting to a student's confusion, a thrilling storyteller captivating an audience, or a knowledgeable research assistant explaining complex concepts with clarity and appropriate gravitas. This level of seamless and inspiring human-AI interaction is our ultimate goal.

The Imperative for Better Benchmarks: Seeing and Hearing the Nuance

The journey to emotionally intelligent AI begins with data. Existing datasets for emotion recognition, while valuable, often present significant limitations. Facial emotion datasets might rely on a narrow range of "basic" emotions, use images with occlusions or poor lighting, or lack demographic diversity, leading to biased models that perform poorly across different populations. Similarly, speech emotion datasets can be constrained by coarse emotion taxonomies, privacy concerns tied to real user data, or an over-reliance on acted portrayals that don't capture the subtlety of spontaneous emotional expression.

The Theory of Constructed Emotion (TCE), a prominent psychological framework (research link), posits that emotions are not universal, pre-programmed entities that we simply "recognize." Instead, they are constructed by our brains based on a combination of interoceptive signals (like valence – pleasantness/unpleasantness, and arousal – activation/deactivation), learned concepts, and contextual information. This means there isn't a single, definitive facial expression or vocal intonation for "joy" or "sadness" that is universally and unambiguously displayed. Rather, emotional expression is a complex, dynamic, and often ambiguous signal.

This understanding underscores the need for emotion estimation rather than simple recognition. We need AI that can assess the likelihood and intensity of various emotions being present, rather than forcing a single label onto a complex human state

The Future: Reasoning About Emotions, and the Dawn of Universal Voice Actors

The ability to accurately estimate emotions is a critical first step. The next frontier is to enable AI systems to reason about these emotions in context. We are convinced that in the very near future, foundation models will be multimodal, taking not only text but also audio natively in and natively out. These will be the "universal voice actors" we envision – capable of understanding, embodying, and expressing a vast range of human personas and emotions.

Imagine prompting an AI: "Speak like a caring nurse comforting a worried patient," or "Tell this story as a slightly grumpy but lovable grandpa." LAION's Got Talent and EMONET-VOICE are paving the way for such capabilities. Furthermore, the rich, multi-label, intensity-aware annotations in our EMONET suites provide the kind of data needed for training advanced reasoning models (like OpenAI's O-family or DeepSeek's R1) to understand the implications of emotional states and predict human future actions or outcomes based on observed cues from mental models, moving beyond simple recognition to true comprehension.

To truly democratize this field, LAION, with Intel's support, is committed to annotating millions of permissively licensed audio samples using our EMPATHIC INSIGHT-VOICE model. This will create an unparalleled public resource, fueling further research and development in self-supervised and multi-modal emotion learning.

Looking ahead, our next ambitious goal is to create a massive, permissively licensed multilingual speech dataset exceeding 500,000 hours. This monumental undertaking is powered by the Intel® Tiber AI Cloud, where we are leveraging its high-performance, 192-core CPU instances to process and curate this unparalleled resource. This will further democratize and accelerate research, paving the way for the next generation of emotionally aware AI.