Distilling Vision-Language Models on Millions of Videos
The research aims to replicate the success of image-text data for video-language models. The researchers fine-tuned a video-language model from a strong image-language baseline with synthesized instructional data. The adapted…
Continue reading