The article discusses the use of normalizing flows in voice conversion (VC) tasks and introduces a new training paradigm called AutoEncoder Normalizing Flow (AE-Flow). Normalizing flows are unsupervised generative models that have shown promising results in text-to-speech and VC. AE-Flow introduces supervision to the training process without the need for parallel data, and adds a reconstruction loss to force the model to use conditioning information to reconstruct an audio sample. The study compares the performance of the AE-Flow model with other models trained with different loss functions and finds that AE-Flow systematically improves speaker similarity and naturalness.
Publication date: 29 Dec 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2312.16552