Improved Baselines with Visual Instruction Tuning
The study focuses on improving large multimodal models (LMMs), specifically the LLaVA model, through visual instruction tuning. By using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data, the…
Continue reading