The paper presents LAVSS, a location-guided audio-visual spatial audio separator. Existing monaural audio-visual separation (MAVS) methods often overlook the location of the sound source, which is crucial in VR/AR scenarios. LAVSS addresses this by incorporating spatial cues and positional representations of sounding objects. This enhances the distinction between similar audio sources located in different directions. LAVSS also uses multi-level cross-modal attention for visual-positional collaboration with audio features, and a pre-trained monaural separator to boost spatial audio separation. Tests on the FAIR-Play dataset show LAVSS’s superiority over existing benchmarks.

 

Publication date: 31 Oct 2023
Project Page: https://yyx666660.github.io/LAVSS/
Paper: https://arxiv.org/pdf/2310.20446