The paper introduces MAGNET, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike previous works, MAGNET is a single-stage, non-autoregressive transformer. During training, it predicts spans of masked tokens from a masking scheduler, and during inference, it gradually constructs the output sequence using several decoding steps. The authors also introduce a novel rescoring method to enhance the quality of the generated audio. They demonstrate the efficiency of MAGNET for the task of text-to-music and text-to-audio generation. The proposed approach is comparable to the evaluated baselines, while being significantly faster.
Publication date: 2024-01-09
Project Page: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT
Paper: https://arxiv.org/pdf/2401.04577