The study focuses on Reinforcement Learning from Human Feedback (RLHF) and how it can be optimized. Traditionally, RLHF involves generating completions from a language model in response to a query and then assigning a score to the full completion using a separate reward model. The researchers propose a new method that uses the attention weights from the transformer architecture of the reward model to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens. This approach stabilises training, accelerates the rate of learning, and may lead to better local optima.

 

Publication date: 1 Feb 2024
Project Page: https://arxiv.org/abs/2402.00782v1
Paper: https://arxiv.org/pdf/2402.00782