The research paper introduces Audiobox, a model capable of generating different audio modalities. It is designed to enhance control and unify the generation of speech and sound. The model enables independent control of transcript, vocal, and other audio styles when producing speech. It utilizes a self-supervised infilling objective for pre-training on large amounts of unlabeled audio to improve model generalization with limited labels. Audiobox establishes new benchmarks in speech and sound generation, offering new methods for producing audio with unique vocal and acoustic styles. It includes Bespoke Solvers for faster generation, without compromising performance on various tasks.
Publication date: 25 Dec 2023
Project Page: https://audiobox.metademolab.com/
Paper: https://arxiv.org/pdf/2312.15821