Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

The research focuses on enhancing Automated Audio Captioning (AAC), which generates descriptions for various sounds. The latest systems use seq2seq models like Transformers. This study aims to improve these models by leveraging pretrained models and large language models (LLMs). The research utilizes BEATs to extract fine-grained audio features and employs an Instructor LLM to fetch text embeddings of captions. A novel data augmentation method using ChatGPT to produce caption mix-ups is proposed, increasing the complexity and diversity of training data. The model achieved a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split and won the 2023 DCASE AAC challenge.

Publication date: 4 Oct 2023
Project Page: Not Specified
Paper: https://arxiv.org/pdf/2309.17352

Post Views: 326

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

A Novel U-Net Architecture for Denoising of Real-world Noise Corrupted Phonocardiogram Signal

RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

Leave a Reply Cancel reply

Please allow ads on our site