The paper introduces ALLURE, a protocol for systematically auditing and improving the accuracy of text evaluation by Large Language Models (LLMs) using iterative In-Context-Learning (ICL). ALLURE audits the evaluation of responses generated by eight LLMs, identifies their failure modes, and improves their evaluation accuracy. The approach involves storing examples of failures in memory, generating ICL prompts, and iteratively improving them to optimize the LLMs’ evaluation of text. The authors anticipate that ALLURE will enhance the robustness of LLMs in various applications related to the evaluation of textual data.

 

Publication date: 24 Sep 2023
Project Page: https://arxiv.org/abs/2309.13701
Paper: https://arxiv.org/pdf/2309.13701