FLAP: Fast Language-Audio Pre-training
The article introduces Fast Language-Audio Pre-training (FLAP), a self-supervised learning approach that learns to align audio and language representations through masking, contrastive learning, and reconstruction. FLAP randomly drops audio spectrogram…
Continue reading