The article introduces SPHINX-X, a Multi-modality Large Language Model (MLLM) series, which is an enhancement of the SPHINX framework. The researchers have improved the architecture and training efficiency by removing redundant visual encoders, simplifying multi-stage training into a one-stage all-in-one paradigm, and bypassing fully-padded sub-images with skip tokens. They have compiled a comprehensive multi-domain and multi-modal dataset covering language, vision, and vision-language tasks. The article reveals a strong correlation between the multi-modal performance with the data and parameter scales.
Publication date: 8 Feb 2024
Project Page: https://github.com/Alpha-VLLM/LLaMA2-Accessory
Paper: https://arxiv.org/pdf/2402.05935