ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

“ChatSpot” is a fully end-to-end multimodal language model designed to enhance the interaction between human and AI, particularly focusing on the usability of multimodal large language models (MLLMs). Existing MLLMs mainly allow interactions through language instructions which limits their interactive accuracy and efficiency. To overcome this, “ChatSpot” introduces precise referring instructions, utilizing diverse reference representations, such as points and boxes, to specify the region of interest. This enables MLLMs to concentrate on the specified region, leading to more fine-grained interactions. Moreover, “ChatSpot” supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, thereby offering a more flexible and seamless interactive experience.

 

Publication date: 18 Jul 2023
Project Page: https://chatspot.streamlit.app
Paper: https://arxiv.org/pdf/2307.09474.pdf