The article discusses the challenges in speech generation using discrete audio tokens derived from self-supervised learning models. It suggests that the current practice of directly utilizing audio tokens complicates sequence modeling and places the burden on the model to establish correlations between tokens. The authors propose a solution called ‘acoustic BPE’ that uses byte-pair encoding to encode frequent audio token patterns, reducing sequence length and leveraging morphological information present in token sequences. This approach has shown advantages like faster inference and improved syntax capturing capabilities. A novel rescore method is also proposed to select the optimal synthetic speech among multiple candidates.
Publication date: 25 Oct 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2310.14580