The article presents a novel training paradigm, ITIT, for vision-language generative models. Current models rely on large corpora of paired image-text data for optimal performance. However, collecting such data leads to low-quality image-text correlation, while human annotation is expensive and laborious. ITIT, grounded in the concept of cycle consistency, allows training on unpaired image and text data. It uses a small set of paired image-text data to ensure reasonable output in both directions. The model is also trained on larger datasets containing only images or texts by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. The study shows that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data.

 

Publication date: 5 Oct 2023
Project Page: ?
Paper: https://arxiv.org/pdf/2310.03734