The article presents PlainSeg, a minimalist and efficient system for semantic segmentation utilizing plain Vision Transformer (ViT) models. The system aims to achieve high performance using a simple structure, consisting of three 3×3 convolutions and transformer layers. The study provides insights into the importance of high-resolution features and the necessity of a larger learning rate for slim transformer decoders. The authors also present PlainSeg-Hier, which utilizes hierarchical features. Tests on four benchmarks show the effectiveness and efficiency of these methods in semantic segmentation.

 

Publication date: 19 Oct 2023
Project Page: https://github.com/ydhongHIT/PlainSeg
Paper: https://arxiv.org/pdf/2310.12755