Masked Audio Generation using a Single Non-Autoregressive Transformer

The paper introduces MAGNET, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike previous works, MAGNET is a single-stage, non-autoregressive transformer. During training, it predicts spans of masked tokens from a masking scheduler, and during inference, it gradually constructs the output sequence using several decoding steps. The authors also introduce a novel rescoring method to enhance the quality of the generated audio. They demonstrate the efficiency of MAGNET for the task of text-to-music and text-to-audio generation. The proposed approach is comparable to the evaluated baselines, while being significantly faster.

Publication date: 2024-01-09
Project Page: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT
Paper: https://arxiv.org/pdf/2401.04577

Post Views: 343

Masked Audio Generation using a Single Non-Autoregressive Transformer

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

HyperGANStrument: Instrument Sound Synthesis and Editing with Pitch-Invariant Hypernetworks

Leave a Reply Cancel reply

Please allow ads on our site