This academic paper introduces a novel zero-shot text-to-speech (TTS) model that can replicate the voice of an unseen speaker without the need for adaptation parameters. The model utilizes multi-scale acoustic prompts to effectively capture the unique speaking style of individuals. The approach involves quantizing speech waveform into discrete acoustic tokens, which are then modeled using a language model. The model’s capabilities are demonstrated by its ability to adapt to new speakers with just a 3-second acoustic prompt. The proposed model outperforms baseline models in terms of naturalness and speaker similarity.

 

Publication date: 25 Sep 2023
Project Page: https://thuhcsi.github.io/icassp2024-msvalle
Paper: https://arxiv.org/pdf/2309.11977