The paper presents a new approach to visual instruction tuning by introducing a fine-grained visual instruction dataset, LVIS-INSTRUCT 4V, containing 220K visually aligned and context-aware instructions. The instructions are produced by prompting GPT-4V with images from LVIS. The research demonstrates that this data improves the performance of LLaVA-1.5, a large multimodal model, across many benchmarks. The dataset and model are available on GitHub.

 

Publication date: 13 Nov 2023
Project Page: https://github.com/X2FD/LVIS-INSTRUCT4V
Paper: https://arxiv.org/pdf/2311.07574