This research paper discusses the identification of artifacts in the feature maps of both supervised and self-supervised Vision Transformer (ViT) networks. These artifacts are high-norm tokens appearing primarily in low-informative background areas of images, repurposed for internal computations. The researchers propose a solution by providing additional tokens to the input sequence of the ViT, which has shown to eliminate the problem entirely for both supervised and self-supervised models. The solution also sets a new standard for self-supervised visual models on dense visual prediction tasks and enables object discovery methods with larger models.
Publication date: 28 Sep 2023
Project Page: https://arxiv.org/abs/2309.16588
Paper: https://arxiv.org/pdf/2309.16588