Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement

The article discusses the challenge of audio-to-audio (A2A) style transfer, especially in the context of transferring emotional content. It proposes a solution in the form of the Zero-shot Emotion Style Transfer (ZEST) model. This model allows the transfer of emotional content from the target audio to the source audio while maintaining the speaker and speech content from the source. The ZEST model is trained using a self-supervision based reconstruction loss, and it can perform emotion transfer even without parallel training data or labels from the source or target audio. The article demonstrates the effectiveness of the ZEST model through objective and subjective quality evaluations.

Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04511

Post Views: 260

Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Class-Incremental Learning for Multi-Label Audio Classification

Leave a Reply Cancel reply

Please allow ads on our site