Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

The article discusses the development of FreeStyleTTS, a model for expressive text-to-speech (TTS) synthesis that requires minimal human annotations. This model utilizes a large language model to transform expressive TTS into a style retrieval task. It selects the best-matching style references based on external style prompts, which can be raw input text or natural language style descriptions. This approach provides flexible and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus show the model’s proficiency in retrieving desired styles from either input text or user-defined descriptions, resulting in synthetic speeches closely aligned with the specified styles.

Publication date: 3 Nov 2023
Project Page: [email protected]
Paper: https://arxiv.org/pdf/2311.01260

Post Views: 286

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features

Leave a Reply Cancel reply

Please allow ads on our site