The paper presents UniAudio, a system that uses language model techniques to generate multiple types of audio, including speech, sounds, music, and singing, based on given input conditions. The system tokenizes all types of target audio, concatenates source-target pairs as a single sequence, and performs next-token prediction using language models. A multi-scale Transformer model is used to handle long sequences caused by the residual vector quantization based neural codec in tokenization. The model is trained on a variety of generative tasks and has demonstrated strong capabilities in all trained tasks. It can also support new audio generation tasks after simple fine-tuning.
Publication date: 1 Oct 2023
Project Page: https://github.com/yangdongchao/UniAudio
Paper: https://arxiv.org/pdf/2310.00704