The article introduces a novel end-to-end text-to-speech model called E3 TTS, which is based on diffusion. Unlike previous models, E3 TTS does not rely on intermediate representations such as spectrogram features or alignment information. Instead, it models the temporal structure of the waveform through a diffusion process. This approach allows E3 TTS to support flexible latent structures within the audio, making it adaptable for zero-shot tasks such as editing without additional training. Experimental results show that E3 TTS can generate high-quality audio that rivals the performance of state-of-the-art neural text-to-speech systems.
Publication date: 3 Nov 2023
Project Page: https://e3tts.github.io
Paper: https://arxiv.org/pdf/2311.00945