This article introduces LLaVA-Phi, a compact multi-modal assistant that utilizes the small language model, Phi-2, to facilitate intricate dialogues integrating both textual and visual elements. Despite having only 3 billion parameters, LLaVA-Phi delivers commendable performances on public benchmarks, encompassing visual comprehension, reasoning, and knowledge-based perception. It is particularly proficient in ScienceQA, outperforming larger multi-modal models. This model opens up new possibilities for applications that require real-time interaction in time-sensitive environments, such as embodied agents, demonstrating the potential of smaller language models to achieve sophisticated levels of understanding and interaction while maintaining greater resource efficiency.

 

Publication date: 5 Jan 2024
Project Page: https://github.com/zhuyiche/llava-phi
Paper: https://arxiv.org/pdf/2401.02330