CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
The paper proposes CREMA, a new efficient and modular modality-fusion framework for video reasoning. This model enhances the flexibility and efficiency of multimodal compositional reasoning approaches by allowing for the…
Continue reading