The article presents a research on evaluating the quality and variability of text generated by Large Language Models (LLMs). Conventional evaluation methods often fail to capture the holistic semantic equivalence, especially in critical applications like healthcare and finance. To address this, the study proposes DCR, an automated framework that uses a divide-conquer-reasoning approach for evaluating and improving the consistency of LLM-generated texts. The approach breaks down the paragraph-to-paragraph comparison into individual sentence-to-paragraph comparisons. An automatic metric converter translates the output into an interpretable numeric score. The DCR approach outperforms other methods in evaluating the consistency of LLM generation across multiple benchmarks.

 

Publication date: 4 Jan 2024
Project Page: https://github.com/intuit-ai-research/DCR-consistency
Paper: https://arxiv.org/pdf/2401.02132