This paper investigates the behavior of trained machine translation (MT) metrics when provided with machine-translated references. The study compares a baseline metric, Prism, that has not been trained on human evaluations, to a trained version, Prism+FT. The findings reveal that Prism+FT is more robust to machine-translated references, a common issue in MT evaluation. This suggests that metric training can improve overall correlation with human judgments. The study proposes a metric evaluation setup that deliberately uses machine-translated references, demonstrating a significant difference in the accuracy of trained and non-trained metrics.

 

Publication date: 1 Dec 2023
Project Page: https://arxiv.org/abs/2312.00536
Paper: https://arxiv.org/pdf/2312.00536