This study presents an improved version of the multi-objective speech assessment model, MOSA-Net+, that uses acoustic features from Whisper, a weakly supervised model, to create embedding features. The research examines the correlation between Whisper’s embedding features and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. It also evaluates Whisper’s effectiveness in deploying a more robust speech assessment model. The results suggest that Whisper’s features correlate more strongly with subjective quality and intelligibility than other SSL’s features, leading to more accurate predictions by MOSA-Net+. Additionally, combining Whisper and SSL models only leads to marginal improvements.

 

Publication date: 25 Sep 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2309.12766