Language Grounded QFormer for Efficient Vision Language Understanding

The paper discusses the challenges of extending large-scale pretraining and instruction tuning to vision-language models due to the diversity in visual inputs. The authors propose a more efficient method for Query Transformer (QFormer)-based vision-language alignment. They demonstrate that their strategy outperforms existing baselines in improving the efficiency of vision-language pretraining. The paper also highlights the need for new data- and compute-efficient strategies for training vision-language models without the expensive pretraining stage. The authors suggest an efficient alternative pipeline for QFormer-based language model conditioning for text generation, which speeds up vision-language representation learning and improves performance.

Publication date: 14 Nov 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2311.07449

Post Views: 320

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Masked Face Dataset Generation and Masked Face Recognition

Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text

Leave a Reply Cancel reply

Please allow ads on our site