AiluRus: A Scalable ViT Framework for Dense Prediction

The article presents AiluRus, a new method to speed up Vision Transformers (ViTs) for dense prediction tasks like object detection and semantic segmentation. The approach reduces the complexity of ViTs by applying adaptive resolution to different regions of an image according to their importance. A spatial-aware density-based clustering algorithm is used to select representative tokens from the token sequence. These tokens are then merged to form low-resolution regions, while irrelevant tokens are preserved as high-resolution regions. This significantly reduces the number of tokens, enabling subsequent layers to handle a reduced token sequence and achieve acceleration. The method has demonstrated promising results, accelerating a ViT model by 48% Frames Per Second without fine-tuning, and saving 52% training time.

Publication date: 2 Nov 2023
Project Page: https://github.com/caddyless/ailurus/tree/main
Paper: https://arxiv.org/pdf/2311.01197

Post Views: 297

AiluRus: A Scalable ViT Framework for Dense Prediction

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Cross-Modal Information-Guided Network using Contrastive Learning for Point Cloud Registration

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Leave a Reply Cancel reply

Please allow ads on our site