The paper discusses CREMA, a modality-fusion framework aimed at enhancing the efficiency of multimodal compositional video reasoning. This framework is designed to integrate any new modality into video reasoning, leveraging existing pre-trained models to augment multiple informative modalities from given videos. It uses a query transformer with multiple parameter-efficient modules associated with each accessible modality and a fusion module designed to compress multimodal queries. The model is validated on various reasoning tasks, showing improved performance while using fewer trainable parameters.

 

Publication date: 8 Feb 2024
Project Page: https://CREMA-VideoLLM.github.io/
Paper: https://arxiv.org/pdf/2402.05889