The paper proposes a method to enhance the performance of a personalized voice activity detection (VAD) model in adverse conditions using self-supervised pretraining on a large unlabelled dataset. The model is pretrained using a long short-term memory (LSTM)-encoder within the autoregressive predictive coding (APC) framework, and then fine-tuned for personalized VAD. The paper also introduces a denoising variant of APC to further improve the robustness of personalized VAD. The results show that self-supervised pretraining not only improves the performance in clean conditions but also makes the models more resilient to adverse conditions compared to purely supervised learning.

 

Publication date: 29 Dec 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2312.16613