This article discusses the development of a noise-robust zero-shot text-to-speech (TTS) method. The method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics accurately. However, it suffers from degradation in speech synthesis quality when the reference speech contains noise. The authors propose incorporating adapters into the SSL model and fine-tuning it with the TTS model using noisy reference speech. They also suggest adopting a speech enhancement (SE) front-end to improve performance. The proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech, proving highly robust to noise in reference speech.

 

Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.05111