The article introduces the PHEME model series for efficient and conversational speech generation. Unlike existing models that require large neural components and extensive training, PHEME models are compact, high-performing, and can be trained on smaller-scale conversational data. This reduces data demands by over 10x while still delivering quality comparable to state-of-the-art models. The PHEME series also enables parallel speech generation and natural conversational speech. Moreover, the application of teacher-student distillation techniques can further improve voice quality for single-speaker setups.
Publication date: 5 Jan 2024
Project Page: https://arxiv.org/abs/2401.02839v1
Paper: https://arxiv.org/pdf/2401.02839