The research paper discusses the challenge in text-to-audio (TTA) generation, particularly the issue of under-specified user prompts when compared to the training text descriptions. The authors treat TTA models as a ‘black box’ and suggest that there is a distribution of audio descriptions (‘audionese’) that TTA models are more adept at generating. They propose a method to rewrite user prompts with instruction-tuned models and use text-audio alignment as feedback signals for audio improvements. The method showed significant improvements in text-audio alignment and music audio quality in both objective and subjective human evaluations.

 

Publication date: 1 Nov 2023
Project Page: https://arxiv.org/abs/2311.00897
Paper: https://arxiv.org/pdf/2311.00897