The paper presents a new text-to-speech (TTS) framework that utilizes a neural transducer. It divides the TTS pipeline into semantic-level sequence-to-sequence modeling and fine-grained acoustic modeling stages. The framework uses discrete semantic tokens from wav2vec2.0 embeddings for robust and efficient alignment modeling. A non-autoregressive (NAR) speech generator then synthesizes waveforms from these semantic tokens. The study’s findings show the model surpasses the baseline in speech quality and speaker similarity. It also highlights the potential of neural transducers in TTS frameworks.
Publication date: 4 Jan 2024
Project Page: ?
Paper: https://arxiv.org/pdf/2401.01498