This paper introduces a new perspective on stochastic optimal control in dynamical systems, a key challenge in sequential decision-making. The authors propose a method of control as Markovian score climbing under samples drawn from a conditional particle filter. This approach provides unbiased estimates for gradient-based policy optimization, without requiring explicit value function learning. The method is applied to the task of learning neural non-Gaussian feedback policies, demonstrating its effectiveness on numerical benchmarks of stochastic dynamical systems. This research contributes to the ongoing exploration of control-as-inference approaches and the balance of exploration-exploitation dynamics in decision-making.

 

Publication date: 21 Dec 2023
Project Page: https://arxiv.org/abs/2312.14000v1
Paper: https://arxiv.org/pdf/2312.14000