The article introduces a novel method called Point-VOS for Video Object Segmentation (VOS). Traditional VOS methods require dense per-object mask annotations which are time-consuming and costly. Point-VOS, however, utilizes a spatio-temporally sparse point-wise annotation scheme, reducing the annotation effort significantly. The authors have applied this scheme to two large-scale video datasets and propose a new Point-VOS benchmark. The study shows that existing VOS methods can be adapted to use point annotations during training and still achieve results close to fully-supervised performance. The data can also be used to improve models connecting vision and language, as demonstrated by evaluating it on the Video Narrative Grounding (VNG) task.
Publication date: 8 Feb 2024
Project Page: https://pointvos.github.io
Paper: https://arxiv.org/pdf/2402.05917