The paper discusses a new approach towards audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. The approach uses different extraction strategies based on audio quality, aiming to balance between interference removal and speech preservation. The paper reveals that the approach achieves a character error rate of 24.2% and 33.2% on the Dev and Eval set, respectively, earning the second place in the challenge. The study contributes to the field of automatic speech recognition and target speaker extraction by utilizing audio quality as a factor for different extraction strategies.
Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.03697