This article presents a Bayesian approach to off-policy evaluation (OPE) and off-policy learning (OPL) for large action spaces. The authors propose a unified Bayesian framework, sDM, that leverages action correlations without compromising computational efficiency. They also introduce Bayesian metrics that assess average performance across multiple problem instances. The framework is evaluated in OPE and OPL, showing the benefits of leveraging action correlations. The authors use online advertising as an example, where the context is user features, the action is product choice, and the reward is click-through rate. They highlight that existing methods often fail in large action spaces, and the proposed method shows strong performance.

 

Publication date: 23 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.14664