The article investigates the basic language capabilities of pre-trained multimodal models, questioning their understanding of image-text interaction. It introduces the BLA Benchmark, a tool designed to evaluate these models based on their processing of language constructs like active-passive voice, coordination, and relative clauses. The study reveals that popular systems like CLIP, ViLBERT, and BLIP2 struggle with these constructs in a zero-shot setting. However, it suggests that the generative model BLIP2 shows promise, particularly in context learning settings. The authors propose that the BLA Benchmark could be used to enhance these models’ language abilities.

 

Publication date: 23 Oct 2023
Project Page: https://arxiv.org/abs/2310.15061v1
Paper: https://arxiv.org/pdf/2310.15061