Confronting Reward Model Overoptimization with Constrained RLHF
The paper focuses on the issue of overoptimization in large language models (LLMs) which are optimized to align with human preferences. The authors highlight that human preferences are multi-faceted and…
Continue reading