LEGO:Language Enhanced Multi-modal Grounding Model

The LEGO model is a multi-modal model that emphasizes both global and local information across different modalities. Unlike existing models, which focus mainly on global information, the LEGO model can understand the fine-grained details of input data, making it more applicable to a wide range of tasks. The model is demonstrated across various tasks such as image grounding, video grounding, sound localization, and multi-modal understanding. The article showcases the performance of LEGO in these tasks, presenting its ability to effectively ground and understand different modalities.

Publication date: 12 Jan 2024
Project Page: https://lzw-lzw.github.io/LEGO.github.io/
Paper: https://arxiv.org/pdf/2401.06071

Post Views: 298

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

PALP: Prompt Aligned Personalization of Text-to-Image Models

MatSynth: A Modern PBR Materials Dataset

Leave a Reply Cancel reply

Please allow ads on our site