The study investigates the advantages and bottlenecks of Large Language Models (LLMs) in addressing Text-rich Visual Question Answering (VQA) tasks. These tasks involve both image comprehension and text recognition. The approach used separates vision and language modules, leveraging external OCR models for text recognition and LLMs for answering questions. The study finds that LLMs have a strong comprehension ability and can introduce helpful knowledge for VQA tasks. The main bottleneck for LLMs in addressing such tasks lies in the visual part. The study also finds that combining OCR models with Multimodal Large Language Models (MLLMs) is effective, providing insights on training an MLLM that preserves the abilities of LLM.

 

Publication date: 13 Nov 2023
Project Page: https://arxiv.org/abs/2311.07306
Paper: https://arxiv.org/pdf/2311.07306