The article presents SPHINX-X, an enhancement of the SPHINX framework for multi-modal large language models (MLLMs). SPHINX-X improves architecture and training efficiency by removing redundant visual encoders, simplifying multi-stage training, and bypassing fully-padded sub-images. It utilizes a comprehensive multi-domain and multi-modal dataset that includes publicly available resources in language, vision, and vision-language tasks. By training over various LLMs, the research achieves a range of MLLMs with different parameter sizes and multilingual abilities. The authors found a strong correlation between the multi-modal performance and the data and parameter scales.
Publication date: 8 Feb 2024
Project Page: https://github.com/Alpha-VLLM/LLaMA2-Accessory
Paper: https://arxiv.org/pdf/2402.05935