This paper discusses the limitations of current masked audio modeling (MAM) methods and presents a new method to enhance the semantic modeling of MAM. The proposed method distills cross-modality knowledge from contrastive language-audio pretraining (CLAP) representations and uses a multi-objective learning strategy with a supervised classification branch. The new method significantly improves performance on multiple downstream tasks and achieves new state-of-the-art results on various audio and speech classification tasks.
Publication date: 31 Jan 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2401.15953