The article presents a study on the use of Reinforcement Learning from Human Feedback (RLHF) to improve the performance of the GPT Neo 125M in the Community Question Answering (CQA) for programming. The study uses scores from Stack Overflow and employs two distinct reward model training strategies for fine-tuning with Proximal Policy Optimization (PPO). The researchers also introduce an auxiliary scoring mechanism, highlighting the need for domain-specific evaluation methods. The study contributes to the ongoing efforts in refining Large Language Models through focused human feedback.
Publication date: 22 Jan 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2401.10882