The article discusses a novel approach to enhance zero-shot multi-speaker text-to-speech (TTS) systems. These systems aim to replicate a chosen speaker’s voice without additional fine-tuning. However, they often struggle to adapt to new speakers, particularly in out-of-domain settings. To address these issues, the researchers propose a negation feature learning paradigm. It models speaker attributes as deviations from the complete audio representation, removing extraneous content information and improving speaker fidelity. The study uses multi-stream Transformers and attention pooling to facilitate learning diverse speaker attributes. It also uses adaptive layer normalizations to fuse speaker representations with target text representations effectively. The researchers found their method to be effective in preserving and harnessing speaker-specific attributes.
Publication date: 5 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.02014