The paper discusses the challenges of extending large-scale pretraining and instruction tuning to vision-language models due to the diversity in visual inputs. The authors propose a more efficient method for Query Transformer (QFormer)-based vision-language alignment. They demonstrate that their strategy outperforms existing baselines in improving the efficiency of vision-language pretraining. The paper also highlights the need for new data- and compute-efficient strategies for training vision-language models without the expensive pretraining stage. The authors suggest an efficient alternative pipeline for QFormer-based language model conditioning for text generation, which speeds up vision-language representation learning and improves performance.
Publication date: 14 Nov 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2311.07449