The study presents Daisy-TTS, a text-to-speech system that simulates a broad spectrum of emotions. It uses a prosody encoder to learn emotionally-separable prosody embedding, which acts as a proxy for emotion. This allows the system to simulate primary and secondary emotions, intensity levels, and emotion polarity. The system demonstrated higher emotional speech naturalness and emotion perceivability in perceptual evaluations compared to the baseline.

 

Publication date: 22 Feb 2024
Project Page: https://rendchevi.github.io/daisy-tts/
Paper: https://arxiv.org/pdf/2402.14523