This paper introduces the Distributional Preference Reward Model (DPRM) to align large language models with a diverse set of human preferences. The researchers used a beta distribution to characterize preferences, which can adapt to preference trends. They designed an optimal-transportation-based loss to calibrate DPRM to align with the preference distribution. The expected reward is used to fine-tune a language model policy to generate responses favored by the population. The study shows that DPRM enhances the alignment of large language models with population preference, yielding more accurate and contextually appropriate responses.

 

Publication date: 16 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.09764