The paper introduces SQuArE (Sentence-level QUestion AnsweRing Evaluation), a new metric for evaluating Question Answering (QA) systems. Current methods, such as human annotations, are expensive and challenging. Recent works have shown that similarity metrics based on transformer LM encoders transfer well for QA evaluation, but they are limited due to the usage of a single correct reference answer. SQuArE addresses this by using multiple reference answers, including correct and incorrect ones, improving the accuracy of predictions. SQuArE was evaluated on various QA systems and datasets, showing superior performance over previous methods.
Publication date: 21 Sep 2023
Project Page: https://arxiv.org/abs/2309.12250v1
Paper: https://arxiv.org/pdf/2309.12250