CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
This paper introduces CREMA, a new and efficient modality-fusion framework designed to improve video reasoning. By leveraging existing pre-trained models, it incorporates multiple informative modalities from videos, such as optical…
Continue reading