Preference learning aims to align the generations of large language models (LLMs) with human preferences. Most previous work focuses on in-distribution preference learning, but this research addresses out-of-distribution (OOD) preference learning. This is useful for enhancing the generalization ability of LLMs with limited preference feedback. The study proposes a general reward model optimized through a meta-learning approach. A bilevel optimization algorithm is used during meta-training to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. The research results outperform various strong baselines across different evaluation metrics.

 

Publication date: 23 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.14760