This paper provides a comprehensive survey of prompt engineering on three types of vision-language models: multimodal-to-text generation models, image-text matching models, and text-to-image generation models. Prompt engineering, the technique of augmenting a pre-trained model with task-specific hints or prompts, has been well-studied in natural language processing and has recently been investigated in vision-language modeling. Yet, there is a gap in systematic overviews of prompt engineering in pre-trained vision-language models, which this study aims to bridge.

 

Publication date: July 24, 2023
Project Page: https://github.com/JindongGu/Awesome-Prompting-on-Vision-Language-Model/
Paper: https://arxiv.org/pdf/2307.12980.pdf