The article presents a novel method, CLAIR, for evaluating machine-generated image captions. This method leverages large language models to assess the captions’ quality. The study found that CLAIR correlates more strongly with human judgment than existing measures, improving relative correlation by 39.6% over SPICE and 18.3% over image-augmented methods like RefCLIP-S. CLAIR also provides noisily interpretable results, allowing the language model to identify the reasoning behind its score. This advancement opens new avenues for automatic evaluation of image captions.

 

Publication date: 19 Oct 2023
Project Page: https://davidmchan.github.io/clair/
Paper: https://arxiv.org/pdf/2310.12971