The article introduces Incremental FastPitch, a novel variant of FastPitch that can produce high-quality Mel chunks incrementally. These advancements allow for a faster synthesis process and more control compared to conventional auto-regressive models. The article also highlights the demand for Text-to-Speech (TTS) systems that can produce speech incrementally, also known as streaming TTS, to provide lower response latency for better user experience. The authors propose improvements in the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that their proposal can produce speech quality comparable to the parallel FastPitch, with significantly lower latency.

 

Publication date: 3 Jan 2024
Project Page: https://arxiv.org/abs/2401.01755
Paper: https://arxiv.org/pdf/2401.01755