The paper focuses on the Contrastive Language-Image Pre-training (CLIP) technique used in computer vision. The authors argue that the success of CLIP lies in its data, not its architecture or pre-training objective. However, CLIP provides limited information about its data and how it’s collected. The authors aim to reveal CLIP’s data curation approach and introduce MetaCLIP, a method that uses a raw data pool and metadata to yield a balanced subset over the metadata distribution. The study shows that MetaCLIP outperforms CLIP’s data on multiple benchmarks and achieves better accuracy. The curation code and training data distribution on metadata are made available for the community.
Publication date: 28 Sep 2023
Project Page: https://github.com/facebookresearch/MetaCLIP
Paper: https://arxiv.org/pdf/2309.16671