The article discusses Auffusion, a Text-to-Audio (TTA) system that leverages the power of diffusion models and large language models. Auffusion adapts Text-to-Image (T2I) diffusion models to the TTA task, improving generation quality and text-audio alignment. The system outperforms previous TTA approaches using limited data and computational resources. The article also highlights the importance of encoder choice on cross-modal alignment, which is often overlooked in TTA studies. Auffusion excels in generating audios that match textual descriptions accurately, proving its effectiveness in tasks like audio style transfer and inpainting.
Publication date: 4 Jan 2024
Project Page: https://auffusion.github.io
Paper: https://arxiv.org/pdf/2401.01044