The article presents SPHINX, a versatile multi-modal large language model (MLLM) that incorporates a joint mixing strategy of model weights, tuning tasks, and visual embeddings. The model is designed to enable multi-purpose capabilities and includes tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation. SPHINX also extracts comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing more robust image representations. The model also integrates an efficient strategy to better capture fine-grained appearances of high-resolution images.

 

Publication date: 13 Nov 2023
Project Page: https://github.com/Alpha-VLLM/LLaMA2-Accessory
Paper: https://arxiv.org/pdf/2311.07575