This academic article presents a novel approach named Contrastive Language-Image Mosaic (CLIM) for aligning region and text representations in object detection. This method effectively utilizes large-scale image-text pairs, combining multiple images into a mosaicked image. Each image is treated as a ‘pseudo region’, and the feature of each pseudo region is trained to be similar to the corresponding text embedding, enabling the model to learn the region-text alignment without expensive box annotations. The experimental results show that CLIM significantly improves open-vocabulary object detectors.

 

Publication date: 19 Dec 2023
Project Page: https://github.com/wusize/CLIM
Paper: https://arxiv.org/pdf/2312.11376