A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
The article focuses on understanding the mechanisms of alignment algorithms, particularly Direct Preference Optimization (DPO), and how they reduce toxicity in language models like GPT2. The researchers first study how…
Continue reading