The paper presents MosaicFusion, a data augmentation approach that uses diffusion models for large vocabulary instance segmentation. This method is training-free and does not rely on label supervision. It uses a text-to-image diffusion model as a dataset generator for object instances and mask annotations. The method divides an image canvas into several regions and performs a diffusion process to generate multiple instances simultaneously, based on different text prompts. It also obtains corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. MosaicFusion can improve the performance of existing instance segmentation models, especially for rare and novel categories.
Publication date: 22 Sep 2023
Project Page: https://github.com/Jiahao000/MosaicFusion
Paper: https://arxiv.org/pdf/2309.13042