Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks

The study focuses on off-policy evaluation (OPE) in reinforcement learning using human preference data. The authors explore the sample efficiency of OPE, establishing a statistical guarantee for it. They approach OPE by learning the value function via fitted-Q-evaluation with a deep neural network. The paper reveals that by selecting the size of a ReLU network, the Markov decision process’s low-dimensional manifold structure can be leveraged for a sample-efficient estimator. The results show alignment with classical OPE results with observable reward data, providing an efficient guarantee for off-policy evaluation with RLHF.

Publication date: 17 Oct 2023
Project Page: ?
Paper: https://arxiv.org/pdf/2310.10556

Post Views: 436

Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Towards the Imagenets of ML4EDA

Population-based wind farm monitoring based on a spatial autoregressive approach

Leave a Reply Cancel reply

Please allow ads on our site