The research presents AMuSE (Adaptive Multimodal Analysis for Speaker Emotion), a model developed for recognizing individual emotions in group conversations. This model is crucial in developing intelligent agents for natural human-machine interaction. The model uses a Multimodal Attention Network that captures cross-modal interactions at different levels of spatial abstraction. It also uses an Adaptive Fusion technique to combine mode-specific descriptors. The model condenses spatial and temporal features into two dense descriptors: speaker-level and utterance-level. It showed improved classification performance in large-scale public datasets.
Publication date: 31 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.15164