FLAP: Fast Language-Audio Pre-training

The article introduces Fast Language-Audio Pre-training (FLAP), a self-supervised learning approach that learns to align audio and language representations through masking, contrastive learning, and reconstruction. FLAP randomly drops audio spectrogram tokens and focuses on the remaining ones for self-supervision. It aligns paired audio and text representations in a shared latent space through inter-modal contrastive learning. FLAP also leverages large language models to augment the text inputs, which improves performance. It has achieved state-of-the-art performance on audio-text retrieval tasks on AudioCaps and Clotho.

Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.01615

Post Views: 340

Press ESC to close

Share Article:

root

Acousto-optic reconstruction of exterior sound field based on concentric circle sampling with circular harmonic expansion

ATGNN: Audio Tagging Graph Neural Network

Please allow ads on our site