The article discusses the challenge of audio-to-audio (A2A) style transfer, especially in the context of transferring emotional content. It proposes a solution in the form of the Zero-shot Emotion Style Transfer (ZEST) model. This model allows the transfer of emotional content from the target audio to the source audio while maintaining the speaker and speech content from the source. The ZEST model is trained using a self-supervision based reconstruction loss, and it can perform emotion transfer even without parallel training data or labels from the source or target audio. The article demonstrates the effectiveness of the ZEST model through objective and subjective quality evaluations.

 

Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04511