The article proposes a Discourse-level Multi-scale text Prosodic Model (D-MPM) that predicts prosodic features for fine-grained emotion analysis. This model can guide the speech synthesis process to produce more expressive speech. A new Discourse-level Chinese Audiobook (DCA) dataset with over 13,000 annotated utterances is also introduced for model evaluation. The model showed promising results in predicting prosodic features and improving user experience. Interestingly, the synthesized speech through this model was found to be better than the original speech in some user evaluation parameters.

 

Publication date: 25 Sep 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2309.11849