Acoustic BPE for Speech Generation with Discrete Tokens

The article discusses the challenges in speech generation using discrete audio tokens derived from self-supervised learning models. It suggests that the current practice of directly utilizing audio tokens complicates sequence modeling and places the burden on the model to establish correlations between tokens. The authors propose a solution called ‘acoustic BPE’ that uses byte-pair encoding to encode frequent audio token patterns, reducing sequence length and leveraging morphological information present in token sequences. This approach has shown advantages like faster inference and improved syntax capturing capabilities. A novel rescore method is also proposed to select the optimal synthetic speech among multiple candidates.

Publication date: 25 Oct 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2310.14580

Post Views: 371

Acoustic BPE for Speech Generation with Discrete Tokens

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

A Novel Transfer Learning Method Utilizing Acoustic and Vibration Signals for Rotating Machinery Fault Diagnosis

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Leave a Reply Cancel reply

Please allow ads on our site