The article discusses a new approach to Automated Audio Captioning (AAC) that eliminates the need for paired audio-text data. This method uses a pre-trained Contrastive Language-Audio Pretraining (CLAP) model and text data only. It bridges the modality gap between audio and text embeddings, and it has shown up to 83% performance compared to fully supervised methods. This approach simplifies domain adaptation and mitigates the data scarcity issue in AAC.
Publication date: 25 Sep 2023
Project Page: https://github.com/zelaki/wsac
Paper: https://arxiv.org/pdf/2309.12242