The paper discusses the challenges of training large language models (LLMs) using Reinforcement Learning from Human Feedback (RLHF). The authors identify three important properties in RLHF tasks: fast simulation, deterministic transitions, and trajectory-level rewards, which aren’t leveraged in the current popular algorithm, PPO. To address this, they propose a new algorithm, ReMax, which is simpler, more efficient, and uses less memory than PPO. ReMax also doesn’t sacrifice performance for these improvements. The authors suggest that these benefits can be maintained in larger-scale models.

 

Publication date: 16 Oct 2023
Project Page: https://github.com/liziniu/ReMax
Paper: https://arxiv.org/pdf/2310.10505