This research presents a Multimodal Interlaced Transformer (MIT) for weakly supervised point cloud segmentation that jointly considers 2D and 3D data. Current methods require extra 2D annotations to achieve 2D-3D information fusion, which increases the annotation cost. To address this, the study proposes a transformer model with two encoders and one decoder that uses only scene-level class tags. The encoders compute the self-attended features for 3D point clouds and 2D multi-view images, while the decoder implements interlaced 2D-3D cross-attention and 2D-3D feature fusion. Experiments show that the MIT performs favorably against existing weakly supervised point cloud segmentation methods.
Publication date: 20 Oct 2023
Project Page: https://jimmy15923.github.io/mit_web/
Paper: https://arxiv.org/pdf/2310.12817