Vision Transformers Need Registers

This research paper discusses the identification of artifacts in the feature maps of both supervised and self-supervised Vision Transformer (ViT) networks. These artifacts are high-norm tokens appearing primarily in low-informative background areas of images, repurposed for internal computations. The researchers propose a solution by providing additional tokens to the input sequence of the ViT, which has shown to eliminate the problem entirely for both supervised and self-supervised models. The solution also sets a new standard for self-supervised visual models on dense visual prediction tasks and enables object discovery methods with larger models.

Publication date: 28 Sep 2023
Project Page: https://arxiv.org/abs/2309.16588
Paper: https://arxiv.org/pdf/2309.16588

Post Views: 315

Vision Transformers Need Registers

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection

Text-to-3D using Gaussian Splatting

Leave a Reply Cancel reply

Please allow ads on our site