Improved Baselines with Visual Instruction Tuning

The study focuses on improving large multimodal models (LMMs), specifically the LLaVA model, through visual instruction tuning. By using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data, the researchers were able to establish stronger baselines. The improved model only uses 1.2M publicly available data and can finish full training in just a day on a single 8-A100 node. The hope is to make state-of-the-art LMM research more accessible.

Publication date: 5 Oct 2021
Project Page: https://llava-vl.github.io
Paper: https://arxiv.org/pdf/2310.03744

Post Views: 337

CLIP-ViT-L-336px, LLaVA, MLP projection, Multimodal Models, visual instruction tuning

Improved Baselines with Visual Instruction Tuning

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

StegGuard: Fingerprinting Self-supervised Pre-trained Encoders via Secrets Embeder and Extractor

ContactGen: Generative Contact Modeling for Grasp Generation

Leave a Reply Cancel reply

Please allow ads on our site