Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

The paper presents a new text-to-speech (TTS) framework that utilizes a neural transducer. It divides the TTS pipeline into semantic-level sequence-to-sequence modeling and fine-grained acoustic modeling stages. The framework uses discrete semantic tokens from wav2vec2.0 embeddings for robust and efficient alignment modeling. A non-autoregressive (NAR) speech generator then synthesizes waveforms from these semantic tokens. The study’s findings show the model surpasses the baseline in speech quality and speaker similarity. It also highlights the potential of neural transducers in TTS frameworks.

Publication date: 4 Jan 2024
Project Page: ?
Paper: https://arxiv.org/pdf/2401.01498

Post Views: 306

expressive speech synthesis, Expressive text-to-speech, neural transducer, semantic token prediction, zero-shot adaptive TTS

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning

Leave a Reply Cancel reply

Please allow ads on our site