This paper addresses the issue of societal bias in Large Language Models (LLMs), specifically Llama 2 7B Chat model. It uses activation steering to probe and mitigate biases related to gender, race, and religion. The findings reveal inherent gender bias in the model, persisting even after Reinforcement Learning from Human Feedback (RLHF). The study also uncovers that RLHF tends to increase the similarity in the model’s representation of different forms of societal biases. The work provides insights into effective red-teaming strategies for LLMs using activation steering.

 

Publication date: 1 Feb 2024
Project Page: https://arxiv.org/abs/2402.00402v1
Paper: https://arxiv.org/pdf/2402.00402