This research article discusses the development of a frame-level emotional state alignment method for speech emotion recognition (SER). The authors fine-tune the HuBERT model to create a SER system with task-adaptive pretraining (TAPT). They extract embeddings from the transformer layers of this model to form frame-level pseudo-emotion labels. The pseudo labels are then used to pretrain HuBERT, ensuring each frame output has corresponding emotional information. The updated model is then fine-tuned for SER by adding an attention layer on top. Tested on IEMOCAP, this method performs better than other state-of-the-art methods.
Publication date: 27 Dec 2023
Project Page: https://github.com/ASolitaryMan/HFLEA.git
Paper: https://arxiv.org/pdf/2312.16383