The paper focuses on aligning language models with human preferences for real-world applications. It discusses the drawbacks of reinforcement learning (RL) and direct preference optimization (DPO) in achieving this goal. RL suffers from high variance in policy updates, and DPO, although simple, is not guaranteed to achieve the optimal policy. To address these issues, the paper proposes efficient exact optimization (EXO) of the alignment objective. The authors demonstrate that EXO can optimize in the same direction as RL algorithms while avoiding their complexities. The paper concludes by comparing EXO with DPO and demonstrating its advantages with realistic human preference data.
Publication date: 2 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.00856