The study focuses on ‘Learning from Preferential Feedback’ (LfPF), a crucial aspect in training large language models and certain interactive learning agents. The authors introduce a new framework, the Direct Preference Process, to analyze LfPF problems in partially-observable, non-Markovian environments. They establish conditions that ensure the existence of optimal policies by considering the ordinal structure of preferences. The Direct Preference Process generalizes the standard reinforcement learning problem and helps to bridge the gap between the empirical success and theoretical understanding of LfPF algorithms.

 

Publication date: 3 Nov 2023
Project Page: https://arxiv.org/abs/2311.01990v1
Paper: https://arxiv.org/pdf/2311.01990