Memory Consolidation Enables Long-Context Video Understanding
The article introduces a novel approach to long-context video understanding using a memory-consolidated vision transformer (MC-ViT). Standard transformer-based video encoders struggle with long-contexts due to their quadratic complexity. The researchers…
Continue reading