The article presents FreeStyleTTS, a model for expressive text-to-speech (TTS) synthesis with minimal human annotations. This approach leverages a large language model to transform expressive TTS into a style retrieval task. It selects the best-matching style references based on external style prompts, guiding the TTS pipeline to synthesize speech with the intended style. The article demonstrates the model’s proficiency in retrieving desired styles from either input text or user-defined descriptions, resulting in synthetic speeches closely aligned with the specified styles.

 

Publication date: 3 Nov 2023
Project Page: [email protected]
Paper: https://arxiv.org/pdf/2311.01260