The article presents DE-ViT, a new open-set object detector that uses DINOv2 vision-only backbones. Instead of using language, DE-ViT learns new categories through example images. The authors transform multi-classification tasks into binary tasks, bypassing per-class inference and suggest a new region propagation technique for localization. The DE-ViT model outperforms the state-of-the-art in open-vocabulary, few-shot, and one-shot object detection benchmarks with COCO and LVIS.

 

Publication date: 22 Sep 2023
Project Page: https://github.com/mlzxy/devit
Paper: https://arxiv.org/pdf/2309.12969