The research aims to replicate the success of image-text data for video-language models. The researchers fine-tuned a video-language model from a strong image-language baseline with synthesized instructional data. The adapted video-language model was then used to auto-label millions of videos to generate high-quality captions. The model performed well on various video-language benchmarks, surpassing previous results. The model also generates detailed descriptions for unseen videos, providing better textual supervision than existing methods. Experiments showed that a video-language dual-encoder model trained on these auto-generated captions is more efficient.

 

Publication date: 12 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.06129