Explainable Multimodal Emotion Reasoning

Paper · arXiv 2306.15401 · Published June 27, 2023

“Multimodal emotion recognition has experienced rapid development in recent years [1, 2]. Current works mainly focus on collecting larger and more realistic datasets [3, 4], or building more effective architectures [5, 6]. Despite promising results, multimodal emotion recognition suffers from label ambiguity. The reason of label ambiguity lies in the subjectivity of emotion itself, i.e., different annotators may assign distinct emotions to the same video. Due to label ambiguity, emotion labels of existing benchmark datasets may be inaccurate. Therefore, it is difficult for the systems developed on these datasets to meet the high reliability requirements in practical applications.

To alleviate label ambiguity, previous works mainly employ more annotators and use majority vote to find the most relevant or several most relevant emotion labels [7, 8]. Although majority vote improves annotation reliability, this process removes correct but non-dominant labels, limiting the ability of describing subtle emotions. Meanwhile, since the annotators of different datasets are usually nonoverlapping, label ambiguity is more serious in cross-corpus settings, resulting in the phenomenon that combining multiple small-scale datasets generally cannot achieve better performance.”