The academic article introduces a new multimodal model that can understand arbitrary visual prompts, enabling users to intuitively mark images using natural cues like a red bounding box or pointed arrow. Unlike current models that primarily focus on whole-image understanding, this model can process region-specific information, making it more user-friendly. It achieves state-of-the-art performance on tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. The researchers also present ViP-Bench, a benchmark to assess the capability of models in understanding visual prompts.

 

Publication date: 4 Dec 2023
Project Page: https://vip-llava.github.io
Paper: https://arxiv.org/pdf/2312.00784