The paper investigates the impact of preference agreement on the efficacy of Reinforcement Learning from Human Feedback (RLHF) in text summarization. The authors demonstrate that including a diverse range of annotator agreement in human preferences leads to more accurate reward models and alters the quality characteristics captured. The findings also indicate improvements in downstream generation when using a reward model trained with a range of preference agreements. This has implications for the design of synthetic datasets and underscores the importance of considering quality differentials in comparison-based data.
Publication date: 2 Nov 2023
Project Page: https://arxiv.org/abs/2311.04919
Paper: https://arxiv.org/pdf/2311.04919