The paper discusses how large language models (LLMs) generate texts that often mix factual and non-factual claims, making it challenging to evaluate their factual precision. The researchers propose an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities. They evaluated the D-FActScores of people biographies generated with retrieval-augmented generation (RAG) and found that it better assesses the factuality of paragraphs with entity ambiguity than the existing metrics, FActScore and citation recall. The study also reveals that four widely used open-source LLMs tend to mix information of distinct entities to form non-factual paragraphs.

 

Publication date: 9 Feb 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2402.05629