The authors have developed an enhanced audio-visual Sound Event Localization and Detection (SELD) network, improving on the audio-only SELDnet23 model by integrating audio and video information. The system uses YOLO and DETIC object detectors, with a framework that implements audio-visual data augmentation and synthetic data generation. The new SELD system outperforms the existing audio-visual SELD baseline. The authors also introduce novel video and audio processing techniques for model training, and provide their work as an open-source framework.
Publication date: 31 Jan 2024
Project Page: https://github.com/aromanusc/SoundQ
Paper: https://arxiv.org/pdf/2401.17129