The study focuses on off-policy evaluation (OPE) in reinforcement learning using human preference data. The authors explore the sample efficiency of OPE, establishing a statistical guarantee for it. They approach OPE by learning the value function via fitted-Q-evaluation with a deep neural network. The paper reveals that by selecting the size of a ReLU network, the Markov decision process’s low-dimensional manifold structure can be leveraged for a sample-efficient estimator. The results show alignment with classical OPE results with observable reward data, providing an efficient guarantee for off-policy evaluation with RLHF.
Publication date: 17 Oct 2023
Project Page: ?
Paper: https://arxiv.org/pdf/2310.10556