The paper presents a new approach to visual instruction tuning by introducing a fine-grained visual instruction dataset, LVIS-INSTRUCT 4V, containing 220K visually aligned and context-aware instructions. The instructions are produced by prompting GPT-4V with images from LVIS. The research demonstrates that this data improves the performance of LLaVA-1.5, a large multimodal model, across many benchmarks. The dataset and model are available on GitHub.


Publication date: 13 Nov 2023
Project Page: