The article presents a semi-supervised method for audio-visual speech recognition (AV-CPL), employing both labeled and unlabeled videos with continuously regenerated pseudo-labels. This method enables the recognition model to be trained using audio-visual inputs, performing speech recognition using either one or both modalities. Significant improvements in Visual Speech Recognition (VSR) performance are highlighted while maintaining practical Automatic Speech Recognition (ASR) and Audio-Visual Speech Recognition (AVSR) performance. The method leverages unlabeled visual speech to enhance VSR.

 

Publication date: 29 Sep 2023
Project Page: arXiv:2309.17395v1
Paper: https://arxiv.org/pdf/2309.17395