What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

This study investigates if large language models (LLMs) exhibit sociodemographic biases, even when they refuse to respond to sensitive prompts. Researchers explored this by probing contextualized embeddings to see if biases are encoded in latent representations. They proposed a logistic Bradley Terry probe that predicts word pair preferences of LLMs from the words’ hidden vectors. The probe was validated on three pair preference tasks and thirteen LLMs, outperforming the standard approach in testing for implicit associations. The findings suggest that instruction fine-tuning does not necessarily debias contextualized embeddings.

Publication date: 30 Nov 2023
Project Page: https://github.com/castorini/biasprobe
Paper: https://arxiv.org/pdf/2311.18812

Post Views: 324

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

Leave a Reply Cancel reply

Please allow ads on our site