The paper focuses on the issue of overoptimization in large language models (LLMs) which are optimized to align with human preferences. The authors highlight that human preferences are multi-faceted and deriving reward from a composition of simpler reward models presents a challenge. The paper introduces a method using constrained reinforcement learning to prevent the agent from exceeding each reward model’s threshold of usefulness. The method also addresses the problem of weighting component reward models by learning dynamic weights. The authors also introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.

 

Publication date: 6 Oct 2023
Project Page: https://arxiv.org/abs/2310.04373
Paper: https://arxiv.org/pdf/2310.04373