This paper discusses the T02 team’s system for the Singing Voice Conversion Challenge 2023 (SVCC2023). The system includes a VITS-based SVC model, which consists of a feature extractor, a voice converter, and a post-processor. The feature extractor uses a HuBERT model to provide F0 contours and extracts speaker-independent linguistic content from the input singing voice. The voice converter recomposes the speaker timbre, F0, and linguistic content to generate the waveform of the target speaker. The paper also introduces a fine-tuned DSPGAN vocoder to resynthesise the waveform and improve the audio quality. The system achieved superior performance in the challenge, especially in the cross-domain task.
Publication date: 10 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.05118