The paper introduces a new method for self-supervised video object segmentation (VOS) by leveraging inherent structural dependencies in DINO-pretrained Transformers. Instead of resorting to auxiliary modalities or iterative slot attention, this approach utilizes a single spatio-temporal Transformer block to process frame-wise DINO features and establish robust spatio-temporal correspondences. Hierarchical clustering is then used to generate object segmentation masks. The method shows top performance across various unsupervised VOS benchmarks and excels in complex multi-object video segmentation tasks.
Publication date: 29 Nov 2023
Project Page: https://github.com/shvdiwnkozbw/SSL-UVOS
Paper: https://arxiv.org/pdf/2311.17893