The research focuses on enhancing Automated Audio Captioning (AAC), which generates descriptions for various sounds. The latest systems use seq2seq models like Transformers. This study aims to improve these models by leveraging pretrained models and large language models (LLMs). The research utilizes BEATs to extract fine-grained audio features and employs an Instructor LLM to fetch text embeddings of captions. A novel data augmentation method using ChatGPT to produce caption mix-ups is proposed, increasing the complexity and diversity of training data. The model achieved a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split and won the 2023 DCASE AAC challenge.
Publication date: 4 Oct 2023
Project Page: Not Specified
Paper: https://arxiv.org/pdf/2309.17352