Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

The article presents FreeStyleTTS, a model for expressive text-to-speech (TTS) synthesis with minimal human annotations. This approach leverages a large language model to transform expressive TTS into a style retrieval task. It selects the best-matching style references based on external style prompts, guiding the TTS pipeline to synthesize speech with the intended style. The article demonstrates the model’s proficiency in retrieving desired styles from either input text or user-defined descriptions, resulting in synthetic speeches closely aligned with the specified styles.

Publication date: 3 Nov 2023
Project Page: op.131@sjtu.edu.cn
Paper: https://arxiv.org/pdf/2311.01260

Post Views: 309

Press ESC to close

Share Article:

root

Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants

DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Please allow ads on our site