Weakly-supervised Automated Audio Captioning via text only training

The article discusses a new approach to Automated Audio Captioning (AAC) that eliminates the need for paired audio-text data. This method uses a pre-trained Contrastive Language-Audio Pretraining (CLAP) model and text data only. It bridges the modality gap between audio and text embeddings, and it has shown up to 83% performance compared to fully supervised methods. This approach simplifies domain adaptation and mitigates the data scarcity issue in AAC.

Publication date: 25 Sep 2023
Project Page: https://github.com/zelaki/wsac
Paper: https://arxiv.org/pdf/2309.12242

Post Views: 325

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis

Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval

Leave a Reply Cancel reply

Please allow ads on our site