SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

The article introduces SPHINX-X, a Multi-modality Large Language Model (MLLM) series, which is an enhancement of the SPHINX framework. The researchers have improved the architecture and training efficiency by removing redundant visual encoders, simplifying multi-stage training into a one-stage all-in-one paradigm, and bypassing fully-padded sub-images with skip tokens. They have compiled a comprehensive multi-domain and multi-modal dataset covering language, vision, and vision-language tasks. The article reveals a strong correlation between the multi-modal performance with the data and parameter scales.

Publication date: 8 Feb 2024
Project Page: https://github.com/Alpha-VLLM/LLaMA2-Accessory
Paper: https://arxiv.org/pdf/2402.05935

Post Views: 264

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Collaborative Control for Geometry-Conditioned PBR Image Generation

Leave a Reply Cancel reply

Please allow ads on our site