The article discusses the limitations of conventional audio classification methods and introduces a novel method that incorporates counterfactual analysis. The proposed model considers acoustic characteristics and sound source information from human-annotated reference texts. It includes counterfactual instances to train models for recognizing sound events and sources in alternative scenarios. The effectiveness of this method was validated via pre-training utilizing multiple audio captioning datasets, and evaluated with several common downstream tasks. The results showed a significant improvement in the top-1 accuracy in open-ended language-based audio retrieval tasks.
Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04935