The article introduces Fast Language-Audio Pre-training (FLAP), a self-supervised learning approach that learns to align audio and language representations through masking, contrastive learning, and reconstruction. FLAP randomly drops audio spectrogram tokens and focuses on the remaining ones for self-supervision. It aligns paired audio and text representations in a shared latent space through inter-modal contrastive learning. FLAP also leverages large language models to augment the text inputs, which improves performance. It has achieved state-of-the-art performance on audio-text retrieval tasks on AudioCaps and Clotho.
Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.01615