Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

The article discusses a novel approach to enhance zero-shot multi-speaker text-to-speech (TTS) systems. These systems aim to replicate a chosen speaker’s voice without additional fine-tuning. However, they often struggle to adapt to new speakers, particularly in out-of-domain settings. To address these issues, the researchers propose a negation feature learning paradigm. It models speaker attributes as deviations from the complete audio representation, removing extraneous content information and improving speaker fidelity. The study uses multi-stream Transformers and attention pooling to facilitate learning diverse speaker attributes. It also uses adaptive layer normalizations to fuse speaker representations with target text representations effectively. The researchers found their method to be effective in preserving and harnessing speaker-specific attributes.

Publication date: 5 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.02014

Post Views: 314

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

PosCUDA: Position based Convolution for Unlearnable Audio Datasets

Generating Rhythm Game Music with Jukebox

Leave a Reply Cancel reply

Please allow ads on our site