CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

The paper proposes CREMA, a new efficient and modular modality-fusion framework for video reasoning. This model enhances the flexibility and efficiency of multimodal compositional reasoning approaches by allowing for the injection of new modalities into video reasoning. The model uses multiple parameter-efficient modules to project diverse modality features to the LLM token embedding space. This allows for the integration of different data types for response generation. The paper also introduces a fusion module that compresses multimodal queries while maintaining computational efficiency. CREMA has been validated on video-3D, video-audio, and video-language reasoning tasks, showing equal or better performance than other multimodal LLMs while using 96% fewer trainable parameters.

Publication date: 8 Feb 2024
Project Page: https://CREMA-VideoLLM.github.io/
Paper: https://arxiv.org/pdf/2402.05889

Post Views: 246

Press ESC to close

Share Article:

root

ClickSAM: Fine-tuning Segment Anything Model using click prompts for ultrasound image segmentation

Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking

Please allow ads on our site