January 31, 2024

Masked Audio Modeling with CLAP and Multi-Objective Learning

This paper discusses the limitations of current masked audio modeling (MAM) methods and presents a new method to enhance the semantic modeling of MAM. The proposed method distills cross-modality knowledge from contrastive language-audio pretraining (CLAP) representations and uses a multi-objective learning strategy with a supervised classification branch. The new method significantly improves performance on multiple downstream tasks and achieves new state-of-the-art results on various audio and speech classification tasks.

Publication date: 31 Jan 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2401.15953

Post Views: 296

root

Exit mobile version

Please allow ads on our site

Looks like you're using an ad blocker. Please support us by disabling these ad blocker.

Press ESC to close

Share Article:

root

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Please allow ads on our site