The study focuses on improving large multimodal models (LMMs), specifically the LLaVA model, through visual instruction tuning. By using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data, the researchers were able to establish stronger baselines. The improved model only uses 1.2M publicly available data and can finish full training in just a day on a single 8-A100 node. The hope is to make state-of-the-art LMM research more accessible.

 

Publication date: 5 Oct 2021
Project Page: https://llava-vl.github.io
Paper: https://arxiv.org/pdf/2310.03744