This research examines the use of GPT-4V as an automated evaluator for vision-language tasks. The paper validates GPT-4V’s capabilities for evaluation purposes, addressing tasks from basic image-to-text and text-to-image synthesis to advanced image-to-image translations. The study employs two evaluation methods, single-answer grading and pairwise comparison, using GPT-4V. Despite certain limitations, GPT-4V demonstrates promising agreement with human evaluations across various tasks and methods, showing significant potential for multi-modal Large Language Models (LLMs) as evaluators.

 

Publication date: 2 Nov 2023
Project Page: https://arxiv.org/abs/2311.01361v1
Paper: https://arxiv.org/pdf/2311.01361