Can CLIP Help Sound Source Localization?

This research investigates the application of Contrastive Language-Image Pretraining (CLIP) model in the area of sound source localization. The authors propose a framework that translates audio signals into tokens compatible with CLIP’s text encoder, generating audio-driven embeddings. The method then uses these embeddings to create audio-grounded masks for the provided audio and extracts audio-grounded image features from highlighted regions. The findings suggest that leveraging pre-trained image-text models allows for the generation of more comprehensive localization maps for sounding objects. The proposed method outperforms existing methods significantly.

Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.04066

Post Views: 309

Can CLIP Help Sound Source Localization?

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Proceedings of the 5th International Workshop on Reading Music Systems

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Leave a Reply Cancel reply

Please allow ads on our site