The article discusses the development of FreeStyleTTS, a model for expressive text-to-speech (TTS) synthesis that requires minimal human annotations. This model utilizes a large language model to transform expressive TTS into a style retrieval task. It selects the best-matching style references based on external style prompts, which can be raw input text or natural language style descriptions. This approach provides flexible and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus show the model’s proficiency in retrieving desired styles from either input text or user-defined descriptions, resulting in synthetic speeches closely aligned with the specified styles.
Publication date: 3 Nov 2023
Project Page: [email protected]
Paper: https://arxiv.org/pdf/2311.01260