The article introduces TinyCLIP, a novel cross-modal distillation method for large-scale language-image pre-trained models like CLIP. The method uses two core techniques: affinity mimicking and weight inheritance. Affinity mimicking allows student models to mimic teachers’ behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. The TinyCLIP method can reduce the size of the pre-trained CLIP ViT-B/32 by 50% while maintaining comparable zero-shot performance. The TinyCLIP ViT-8M/16 model, trained on YFCC-15M, surpasses the original CLIP ViT-B/16 by 3.5% while using only 8.9% parameters.
Publication date: 22 Sep 2023
Project Page: aka.ms/tinyclip
Paper: https://arxiv.org/pdf/2309.12314