This paper introduces CREMA, a new and efficient modality-fusion framework designed to improve video reasoning. By leveraging existing pre-trained models, it incorporates multiple informative modalities from videos, such as optical flow, 3D point clouds, and audio. CREMA introduces a query transformer that employs parameter-efficient modules associated with each available modality. These modules project diverse modality features to the LLM token embedding space, enabling the model to integrate different data types for response generation. The paper also proposes a fusion module designed to compress multimodal queries, which maintains computational efficiency in the LLM while combining additional modalities. CREMA achieves better or equivalent performance against other multimodal LLMs, while using 96% fewer trainable parameters.

 

Publication date: 8 Feb 2024
Project Page: https://CREMA-VideoLLM.github.io/
Paper: https://arxiv.org/pdf/2402.05889