The article discusses the development and functionality of PromptSpeaker, a system that uses text prompts to generate custom speaker voices. The PromptSpeaker system consists of a prompt encoder, a zero-shot VITS, and a Glow model. The prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. This semantic representation is then converted into a speaker representation by the Glow model, and the zero-shot VITS synthesizes the speaker’s voice based on this representation. The authors verify that PromptSpeaker can generate new speakers not included in the training set and that the synthetic speaker voice matches the speaker prompt reasonably well.

 

Publication date: 10 Oct 2023
Project Page: https://promptspeaker.github.io/demo/
Paper: https://arxiv.org/pdf/2310.05001