The paper addresses the issue of distributional shift in machine learning models, especially in text-to-audio generation. The authors observe a consistent audio quality degradation in generated audio samples with user prompts as opposed to training set prompts. They present a retrieval-based in-context prompt editing framework that uses training captions as demonstrative exemplars to improve user prompts. The framework showed an enhancement in audio quality across a set of collected user prompts.
Publication date: 3 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.00895