The article investigates the generalization capabilities of Multimodal Large Language Models (MLLMs) under out-of-distribution scenarios and domain-specific tasks. Evaluations across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery show that MLLMs struggle with generalization beyond common training domains. The research identifies mapping deficiency as the main hurdle to reliable performance. However, the application of in-context learning (ICL) significantly enhances MLLMs’ generalization, though it shows vulnerability to domain shifts, label shifts, and spurious correlation shifts between in-context examples and test data.

 

Publication date: 2024-02-09
Project Page: https://arxiv.org/abs/2402.06599v1
Paper: https://arxiv.org/pdf/2402.06599