The paper proposes CREMA, a new efficient and modular modality-fusion framework for video reasoning. This model enhances the flexibility and efficiency of multimodal compositional reasoning approaches by allowing for the injection of new modalities into video reasoning. The model uses multiple parameter-efficient modules to project diverse modality features to the LLM token embedding space. This allows for the integration of different data types for response generation. The paper also introduces a fusion module that compresses multimodal queries while maintaining computational efficiency. CREMA has been validated on video-3D, video-audio, and video-language reasoning tasks, showing equal or better performance than other multimodal LLMs while using 96% fewer trainable parameters.
Publication date: 8 Feb 2024
Project Page: https://CREMA-VideoLLM.github.io/
Paper: https://arxiv.org/pdf/2402.05889