This academic article discusses the potential of using uncertainty measures derived from self-supervised learning (SSL) models like wav2vec for predicting audio quality in voice synthesis and conversion systems. Traditional methods such as Mean Opinion Scores (MOS) are challenging to collect at scale, hence the need for an efficient prediction method. The authors propose that model uncertainty around the contents of an audio sequence can correspond to low audio quality. Their findings reveal that uncertainty measures can serve as effective proxies for audio quality assessment, particularly in low-resource settings. The study is based on data from the 2022 and 2023 VoiceMOS challenges.
Publication date: 29 Dec 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2312.15616