The article delves into the study of off-policy evaluation (OPE) in environments with complex observations, aiming to develop estimators that can avoid exponential dependence on the horizon. The authors explore the future-dependent value functions framework, which has been proposed to address the issue of estimation errors depending on the state-density ratio. They highlight the limitations of this method and propose novel coverage assumptions specifically for the structure of Partially Observable Markov Decision Processes (POMDPs). These assumptions lead to the discovery of new algorithms with complementary properties, providing polynomial bounds on quantities related to the future-dependent value function.

 

Publication date: 22 Feb 2024
Project Page: https://arxiv.org/abs/2402.14703v1
Paper: https://arxiv.org/pdf/2402.14703