This paper introduces DurIAN-E, an improved duration informed attention neural network for expressive and high-quality text-to-speech synthesis. DurIAN-E uses multiple stacked SwishRNN-based Transformer blocks as linguistic encoders and incorporates Style-Adaptive Instance Normalization (SAIN) layers to enhance expressiveness. A denoiser is also used to improve the synthetic speech quality and expressiveness. The model outperforms state-of-the-art approaches in subjective mean opinion score and preference tests, proving its effectiveness in synthesizing more natural sounding speech.

 

Publication date: 25 Sep 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2309.12792