EgoVLPv2 is a significant improvement over the previous generation of egocentric video-language pre-training (EgoVLP). It incorporates cross-modal fusion directly into the video and language backbones, learning strong video-text representation during pre-training and reusing the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner. This reduces fine-tuning costs and makes the system more lightweight and compute-efficient. EgoVLPv2 has demonstrated its effectiveness by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks.

 

Publication date: July 11, 2023
Project Page: https://shramanpramanick.github.io/EgoVLPv2/
Paper: https://arxiv.org/pdf/2307.05463.pdf