CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
The paper discusses CREMA, a modality-fusion framework aimed at enhancing the efficiency of multimodal compositional video reasoning. This framework is designed to integrate any new modality into video reasoning, leveraging…
Continue reading