The article focuses on understanding the mechanisms of alignment algorithms, particularly Direct Preference Optimization (DPO), and how they reduce toxicity in language models like GPT2. The researchers first study how toxicity is represented and elicited in the GPT2 model. Then, DPO is applied with a carefully crafted pairwise dataset to reduce toxicity. The study finds that the capabilities learned from pre-training are not removed but bypassed. This insight is used to demonstrate a simple method to revert the model back to its original toxic behavior.

 

Publication date: 5 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.01967