This paper introduces DECOLA, a new open-vocabulary detection framework that uses both image-level labels and detailed detection annotations. The framework works in three steps: training a language-conditioned object detector, pseudo-labeling images, and training an unconditioned open-vocabulary detector on the pseudo-annotated images. DECOLA shows strong performance in zero-shot scenarios. It outperforms previous approaches by providing more accurate pseudo-labels due to its conditioning mechanism. This approach achieves state-of-the-art results across various model sizes, architectures, and datasets.
Publication date: 29 Nov 2023
Project Page: https://github.com/janghyuncho/DECOLA
Paper: https://arxiv.org/pdf/2311.17902