The article discusses a novel approach to improve Active Speaker Detection (ASD) – a task to identify if a person is speaking in a series of video frames. The authors propose TalkNCE, a unique talk-aware contrastive loss that encourages the model to learn effective representations through the natural correspondence of speech and facial movements. This loss can be jointly optimized with the existing objectives for training ASD models without needing additional supervision or training data. The study demonstrates that this loss can be easily integrated into existing ASD frameworks, thereby improving their performance. The method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
Publication date: 21 Sep 2023
Project Page: https://arxiv.org/abs/2309.12306
Paper: https://arxiv.org/pdf/2309.12306