The paper introduces VisLingInstruct, an innovative method to improve Multi-Modal Language Models (MMLMs) in zero-shot learning tasks. Current MMLMs’ performance relies heavily on the quality of instructions, and VisLingInstruct addresses this by autonomously evaluating and optimizing instructional texts through In-Context Learning. This improves the synergy between visual perception and linguistic expression in MMLMs. The researchers also optimized the visual feature extraction modules in MMLMs, enhancing their responsiveness to textual cues. Experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly boosts zero-shot performance in visual multi-modal tasks.

 

Publication date: 13 Feb 2024
Project Page: https://github.com/Zhudongsheng75/VisLingInstruct
Paper: https://arxiv.org/pdf/2402.07398