This study investigates if large language models (LLMs) exhibit sociodemographic biases, even when they refuse to respond to sensitive prompts. Researchers explored this by probing contextualized embeddings to see if biases are encoded in latent representations. They proposed a logistic Bradley Terry probe that predicts word pair preferences of LLMs from the words’ hidden vectors. The probe was validated on three pair preference tasks and thirteen LLMs, outperforming the standard approach in testing for implicit associations. The findings suggest that instruction fine-tuning does not necessarily debias contextualized embeddings.
Publication date: 30 Nov 2023
Project Page: https://github.com/castorini/biasprobe
Paper: https://arxiv.org/pdf/2311.18812