This paper introduces a new diffusion autoregressive model (DIFFAR) for generating high-quality raw speech waveforms. The model generates overlapping frames sequentially, each conditioned on a portion of the previously generated one. This allows for unlimited speech duration while maintaining high-fidelity synthesis and temporal coherence. The model can be used for both unconditional and conditional speech generation, with the latter driven by an input sequence of phonemes, amplitudes, and pitch values. One advantage of working directly on the waveform is the creation of local acoustic behaviors, such as vocal fry, making the sound more natural. The model is stochastic, meaning each inference generates a slightly different waveform variation, adding to the richness of the sound. The researchers’ experiments demonstrate that the model produces superior quality speech compared to other state-of-the-art neural speech generation systems.

 

Publication date: 2 Oct 2023
Project Page: https://arxiv.org/abs/2310.01381v1
Paper: https://arxiv.org/pdf/2310.01381