The article presents REVEAL, a new dataset for benchmarking automatic verifiers of complex reasoning chains in language models. This is useful in complex reasoning tasks that require step-by-step answers, known as Chain-of-Thought. These reasoning chains improve performance in tasks, with language models showing better results when encouraged to generate the reasoning chain behind their answer. However, there has been a lack of fine-grained step-level datasets to evaluate such verification methods. REVEAL aims to fill this gap by providing comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model’s answer.

 

Publication date: 2 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.00559