Transforming and Combining Rewards for Aligning Large Language Models

The study examines two problems in aligning language models to human preferences using reward models. Firstly, it questions if a certain monotone transformation of the reward model can preserve preference ranking better than others. Secondly, it investigates how to combine multiple reward models when aligning language models to multiple properties. The paper proposes a probabilistic interpretation of the alignment procedure, which emphasizes improving poorly-performing outputs and enables principled aggregation of rewards. Experimental results show significant improvements over the non-transformed approach.

Publication date: 1 Feb 2024
Project Page: https://arxiv.org/abs/2402.00742v1
Paper: https://arxiv.org/pdf/2402.00742

Post Views: 274

Press ESC to close

Share Article:

root

Enhancing Ethical Explanations of Large Language Models through Iterative Symbolic Refinement

Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders

Please allow ads on our site