CLIPSONIC: TEXT-TO-AUDIO SYNTHESIS WITH UNLABELED VIDEOS AND PRETRAINED LANGUAGE-VISION MODELS
CLIPSONIC is a novel approach to text-to-audio synthesis that leverages unlabeled videos and pretrained language-vision models. The study aims to address the challenge of acquiring high-quality text annotations for audio…
Continue reading