The study examines two problems in aligning language models to human preferences using reward models. Firstly, it questions if a certain monotone transformation of the reward model can preserve preference ranking better than others. Secondly, it investigates how to combine multiple reward models when aligning language models to multiple properties. The paper proposes a probabilistic interpretation of the alignment procedure, which emphasizes improving poorly-performing outputs and enables principled aggregation of rewards. Experimental results show significant improvements over the non-transformed approach.
Publication date: 1 Feb 2024
Project Page: https://arxiv.org/abs/2402.00742v1
Paper: https://arxiv.org/pdf/2402.00742