E3 TTS: Easy End-to-End Diffusion-based Text to Speech

The article introduces a novel end-to-end text-to-speech model called E3 TTS, which is based on diffusion. Unlike previous models, E3 TTS does not rely on intermediate representations such as spectrogram features or alignment information. Instead, it models the temporal structure of the waveform through a diffusion process. This approach allows E3 TTS to support flexible latent structures within the audio, making it adaptable for zero-shot tasks such as editing without additional training. Experimental results show that E3 TTS can generate high-quality audio that rivals the performance of state-of-the-art neural text-to-speech systems.

Publication date: 3 Nov 2023
Project Page: https://e3tts.github.io
Paper: https://arxiv.org/pdf/2311.00945

Post Views: 286

Press ESC to close

Share Article:

root

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

On The Open Prompt Challenge In Conditional Audio Generation

Please allow ads on our site