CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
This paper introduces CREMA, a new and efficient modality-fusion framework designed to improve video reasoning. By leveraging…
This paper introduces CREMA, a new and efficient modality-fusion framework designed to improve video reasoning. By leveraging…
The Segment Anything Model (SAM) is a widely used tool for image processing, but its application in…
The article introduces SPHINX-X, a series of Multi-modality Large Language Models (MLLMs) developed based on SPHINX. This…
The paper introduces a new category of diffusion models built on state space architecture for image data….
The article discusses the challenge of 6D object pose estimation and the improved accuracy achieved by incorporating…
The paper presents a new dataset called DAPlankton for developing and benchmarking domain adaptation methods for image…
The article presents a novel training strategy for deep denoisers in signal and image processing. The strategy…
The article presents a new method for estimating robot pose from RGB images, even when robot internal…
This research presents an ordinal regression framework for assessing disease severity in chest radiographs using deep learning….
This article introduces DiffSpeaker, a new model for speech-driven 3D facial animation. Traditional models use either Diffusion…