The article introduces a novel approach to long-context video understanding using a memory-consolidated vision transformer (MC-ViT). Standard transformer-based video encoders struggle with long-contexts due to their quadratic complexity. The researchers propose to fine-tune these pretrained transformers to attend to memories derived from past activations, enabling them to understand longer video contexts. This method sets a new state-of-the-art in long-context video understanding on EgoSchema, Perception Test, and Diving48, outperforming methods with significantly more parameters.
Publication date: 8 Feb 2024
Project Page: https://arxiv.org/abs/2402.05861
Paper: https://arxiv.org/pdf/2402.05861