Despite declining to respond to controversial prompts, Large Language Models (LLMs) may still exhibit sociodemographic biases in their latent representations. This study proposes a logistic Bradley Terry probe to detect these biases by predicting word pair preferences from the words’ hidden vectors. Initial validation outperformed the standard Word Embedding Association Test (WEAT) by 27%. When applied to controversial tasks, substantial biases were observed for all target classes. For instance, the Mistral model implicitly preferred Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics.

 

Publication date: 30 Nov 2023
Project Page: https://github.com/castorini/biasprobe
Paper: https://arxiv.org/pdf/2311.18812