November 14, 2023

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

The paper presents a new approach to visual instruction tuning by introducing a fine-grained visual instruction dataset, LVIS-INSTRUCT 4V, containing 220K visually aligned and context-aware instructions. The instructions are produced by prompting GPT-4V with images from LVIS. The research demonstrates that this data improves the performance of LLaVA-1.5, a large multimodal model, across many benchmarks. The dataset and model are available on GitHub.

Publication date: 13 Nov 2023
Project Page: https://github.com/X2FD/LVIS-INSTRUCT4V
Paper: https://arxiv.org/pdf/2311.07574

Post Views: 345

root

Exit mobile version

Please allow ads on our site

Looks like you're using an ad blocker. Please support us by disabling these ad blocker.

Press ESC to close

Share Article:

root

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

Please allow ads on our site