The article presents GEST, a new dataset for measuring gender stereotypical reasoning in masked language models and English-to-X machine translation systems. GEST contains samples compatible with 9 Slavic languages and English for 16 gender stereotypes. The study used GEST to evaluate 11 masked language models and 4 machine translation systems and found significant amounts of stereotypical reasoning in almost all evaluated models and languages. The dataset aims to help identify and mitigate systems’ problematic behaviors related to gender biases and stereotypes.
Publication date: 30 Nov 2023
Project Page: https://github.com/kinit-sk/gestar
Paper: https://arxiv.org/pdf/2311.18711