This article presents the development of LT-ViT, a Vision Transformer (ViT) for multi-label Chest X-ray (CXR) classification. Unlike previous ViTs, LT-ViT aggregates information from multiple scales, improving vision-only training for CXRs. The transformer utilizes attention between image tokens and auxiliary tokens representing labels. The study found that LT-ViT outperforms existing pure ViTs on two CXR datasets, is generalizable to other pre-training methods, and enables model interpretability without grad-cam and its variants.
Publication date: 14 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.07263