The article focuses on Speech Emotion Recognition (SER), an essential tool in enhancing human-computer interaction by understanding emotional states. The authors propose a novel approach that combines self-supervised feature extraction using the Wav2Vec model with supervised classification for emotion recognition from small audio segments. The findings suggest that the proposed method outperforms two baseline methods, the support vector machine classifier and transfer learning of a pre-trained Convolutional Neural Network (CNN). The study thus highlights the significance of deep unsupervised feature learning in improving SER and enhancing emotional comprehension in human-computer interactions.
Publication date: 25 Sep 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2309.12714