Modern speech synthesis systems have significantly improved, making synthetic speech indistinguishable from real speech. However, evaluating synthetic speech remains a challenge. Human evaluation using Mean Opinion Score (MOS) is ideal but inefficient due to high costs. Thus, researchers have developed automatic metrics like Word Error Rate (WER) for intelligibility measurement. This paper proposes a new evaluation technique where an ASR model is trained on synthetic speech and its performance is assessed on real speech. The WER on real speech is believed to reflect the similarity between distributions, providing a broader assessment of synthetic speech quality beyond intelligibility.

 

Publication date: 4 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.00706