Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

The article discusses Auffusion, a Text-to-Audio (TTA) system that leverages the power of diffusion models and large language models. Auffusion adapts Text-to-Image (T2I) diffusion models to the TTA task, improving generation quality and text-audio alignment. The system outperforms previous TTA approaches using limited data and computational resources. The article also highlights the importance of encoder choice on cross-modal alignment, which is often overlooked in TTA studies. Auffusion excels in generating audios that match textual descriptions accurately, proving its effectiveness in tasks like audio style transfer and inpainting.

Publication date: 4 Jan 2024
Project Page: https://auffusion.github.io
Paper: https://arxiv.org/pdf/2401.01044

Post Views: 338

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning

HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids

Leave a Reply Cancel reply

Please allow ads on our site