The study presents a new model, CRITIQUE LLM, designed to evaluate the effectiveness of large language models (LLMs) such as GPT-4. Traditional evaluation metrics have shown limited effectiveness, and thus researchers have started to create their own evaluation models. CRITIQUE LLM is a dialogue-based method that provides high-quality referenced or reference-free evaluation data. The experimental results show that this model can match or even surpass GPT-4’s performance in system-level correlations across a variety of tasks. The authors argue that CRITIQUE LLM shows promise in scaling properties and can provide scalable feedback to improve LLMs.

 

Publication date: 1 Dec 2023
Project Page: https://github.com/thu-coai/CritiqueLLM
Paper: https://arxiv.org/pdf/2311.18702