The paper proposes an algorithm, Pessimistic Nonlinear Least-Square Value Iteration (PNLSVI), for offline reinforcement learning with non-linear function approximation. The algorithm includes three innovative components: a variance-based weighted regression scheme, a subroutine for variance estimation, and a planning phase that uses a pessimistic value iteration approach. The authors claim that their algorithm provides a regret bound that is tightly dependent on the complexity of the function class and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. This extends previous results within simpler function classes, such as linear and differentiable function to a more general framework.

 

Publication date: 2 Oct 2023
Project Page: https://arxiv.org/abs/2310.01380
Paper: https://arxiv.org/pdf/2310.01380