The article presents AiluRus, a new method to speed up Vision Transformers (ViTs) for dense prediction tasks like object detection and semantic segmentation. The approach reduces the complexity of ViTs by applying adaptive resolution to different regions of an image according to their importance. A spatial-aware density-based clustering algorithm is used to select representative tokens from the token sequence. These tokens are then merged to form low-resolution regions, while irrelevant tokens are preserved as high-resolution regions. This significantly reduces the number of tokens, enabling subsequent layers to handle a reduced token sequence and achieve acceleration. The method has demonstrated promising results, accelerating a ViT model by 48% Frames Per Second without fine-tuning, and saving 52% training time.
Publication date: 2 Nov 2023
Project Page: https://github.com/caddyless/ailurus/tree/main
Paper: https://arxiv.org/pdf/2311.01197