This article discusses the application of Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). It critically examines the use of Proximal Policy Optimization (PPO), which though popular, involves high computational cost and sensitive hyperparameter tuning. The authors propose a simpler, less computationally expensive method that maintains or even improves performance. The study suggests that REINFORCE-style optimization variants outperform PPO and other RL-free methods such as DPO and RAFT.
Publication date: 22 Feb 2024
Project Page: https://arxiv.org/pdf/2402.14740.pdf
Paper: https://arxiv.org/pdf/2402.14740