The paper introduces CoMoSVC, a consistency model-based Singing Voice Conversion (SVC) method aimed at achieving high-quality generation and high-speed sampling. The authors first design a diffusion-based teacher model specifically for SVC, from which a student model is further distilled under self-consistency properties to achieve one-step sampling. Tests reveal that CoMoSVC has a significantly faster inference speed than the state-of-the-art diffusion-based SVC system, while still delivering comparable or superior conversion performance. The audio samples and codes are available online.

 

Publication date: 3 Jan 2024
Project Page: https://comosvc.github.io/
Paper: https://arxiv.org/pdf/2401.01792