UniAudio: An Audio Foundation Model Toward Universal Audio Generation

The paper presents UniAudio, a system that uses language model techniques to generate multiple types of audio, including speech, sounds, music, and singing, based on given input conditions. The system tokenizes all types of target audio, concatenates source-target pairs as a single sequence, and performs next-token prediction using language models. A multi-scale Transformer model is used to handle long sequences caused by the residual vector quantization based neural codec in tokenization. The model is trained on a variety of generative tasks and has demonstrated strong capabilities in all trained tasks. It can also support new audio generation tasks after simple fine-tuning.

Publication date: 1 Oct 2023
Project Page: https://github.com/yangdongchao/UniAudio
Paper: https://arxiv.org/pdf/2310.00704

Post Views: 285

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

F0 analysis of Ghanaian pop singing reveals progressive alignment with equal temperament over the past three decades: a case study

Pianist Identification Using Convolutional Neural Networks

Leave a Reply Cancel reply

Please allow ads on our site