February 24, 2024

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

This article discusses the application of Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). It critically examines the use of Proximal Policy Optimization (PPO), which though popular, involves high computational cost and sensitive hyperparameter tuning. The authors propose a simpler, less computationally expensive method that maintains or even improves performance. The study suggests that REINFORCE-style optimization variants outperform PPO and other RL-free methods such as DPO and RAFT.

Publication date: 22 Feb 2024
Project Page: https://arxiv.org/pdf/2402.14740.pdf
Paper: https://arxiv.org/pdf/2402.14740

Post Views: 324

causal reinforcement learning, Generative Large Language Models, Proximal Policy Optimization, Reinforcement Learning from Human Feedback, RLHF

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Prompting a Pretrained Transformer Can Be a Universal Approximator

How Transformers Learn Causal Structure with Gradient Descent

Leave a Reply Cancel reply

Please allow ads on our site