This research investigates the application of Contrastive Language-Image Pretraining (CLIP) model in the area of sound source localization. The authors propose a framework that translates audio signals into tokens compatible with CLIP’s text encoder, generating audio-driven embeddings. The method then uses these embeddings to create audio-grounded masks for the provided audio and extracts audio-grounded image features from highlighted regions. The findings suggest that leveraging pre-trained image-text models allows for the generation of more comprehensive localization maps for sounding objects. The proposed method outperforms existing methods significantly.
Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.04066