The LEGO model is a multi-modal model that emphasizes both global and local information across different modalities. Unlike existing models, which focus mainly on global information, the LEGO model can understand the fine-grained details of input data, making it more applicable to a wide range of tasks. The model is demonstrated across various tasks such as image grounding, video grounding, sound localization, and multi-modal understanding. The article showcases the performance of LEGO in these tasks, presenting its ability to effectively ground and understand different modalities.

 

Publication date: 12 Jan 2024
Project Page: https://lzw-lzw.github.io/LEGO.github.io/
Paper: https://arxiv.org/pdf/2401.06071