The article presents a novel approach to speaker verification, a key technology for person authentication. It focuses on audio-visual fusion, leveraging both faces and voices for more comprehensive information. The study suggests that current methods have not fully explored this potential. The authors introduce a method of cross-modal joint attention to fully utilize the inter-modal complementary information and the intra-modal information. This method estimates cross-attention weights based on the correlation between joint feature presentation and individual feature representations. The study shows a significant improvement in the performance of audio-visual fusion for speaker verification using the proposed approach. The approach was tested on the Voxceleb1 dataset and outperformed the state-of-the-art methods.

 

Publication date: 28 Sep 2023
Project Page: https://arxiv.org/abs/2309.16569v1
Paper: https://arxiv.org/pdf/2309.16569