This article presents a novel methodology named Multimodal Pathway which aims to improve transformers using irrelevant data from other modalities. For instance, an ImageNet model can be improved with audio or point cloud datasets. The process involves using an auxiliary transformer trained with data of another modality and constructing pathways to connect the two models. This allows data of the target modality to be processed by both models. The method is distinguished by its utilization of unpaired data from different modalities. The researchers observed significant and consistent performance improvements in image, point cloud, video, and audio recognition tasks using this method.

 

Publication date: 26 Jan 2024
Project Page: https://ailab-cvc.github.io/M2PT/
Paper: https://arxiv.org/pdf/2401.14405