Aligning Crowd Feedback via Distributional Preference Reward Modeling
This paper introduces the Distributional Preference Reward Model (DPRM) to align large language models with a diverse set of human preferences. The researchers used a beta distribution to characterize preferences,…
Continue reading