Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

The paper introduces a new method for self-supervised video object segmentation (VOS) by leveraging inherent structural dependencies in DINO-pretrained Transformers. Instead of resorting to auxiliary modalities or iterative slot attention, this approach utilizes a single spatio-temporal Transformer block to process frame-wise DINO features and establish robust spatio-temporal correspondences. Hierarchical clustering is then used to generate object segmentation masks. The method shows top performance across various unsupervised VOS benchmarks and excels in complex multi-object video segmentation tasks.

Publication date: 29 Nov 2023
Project Page: https://github.com/shvdiwnkozbw/SSL-UVOS
Paper: https://arxiv.org/pdf/2311.17893

Post Views: 285

Press ESC to close

Share Article:

root

Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis

Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation

Please allow ads on our site