The article introduces SPHINX-X, a series of Multi-modality Large Language Models (MLLMs) developed based on SPHINX. This model aims to enhance the architecture and training efficiency by removing unnecessary visual encoders, skipping fully-padded sub-images with skip tokens, and simplifying multi-stage training into a single-stage all-in-one paradigm. The paper presents a comprehensive multi-domain and multi-modal dataset, which includes publicly available resources for language, vision, and vision-language tasks. The SPHINX-X models demonstrate a strong correlation between multi-modal performance and the scale of data and parameters.
Publication date: 8 Feb 2024
Project Page: https://github.com/Alpha-VLLM/LLaMA2-Accessory
Paper: https://arxiv.org/pdf/2402.05935