Vision-Language Models like CLIP are becoming prevalent due to their excellent generalization abilities. However, adapting them for downstream tasks remains challenging. One method involves learning prompts using visual information, but this often requires labeled data and struggles to generalize. An alternative approach uses training-free methods by generating class descriptions from large language models (LLMs) and performing prompt ensembling. This work proposes a hybrid method called ProText that learns prompts using only text data from LLMs. This method allows for zero-shot transfer of prompts to new classes and datasets, potentially reducing the LLM prompt engineering cost. The study shows that ProText improves over previous ensembling works while being competitive to those using labeled images.

 

Publication date: 5 Jan 2024
Project Page: https://github.com/muzairkhattak/ProText
Paper: https://arxiv.org/pdf/2401.02418