The study focuses on the problem of relative overgeneralization (RO) in cooperative multi-agent learning tasks. RO occurs when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. The authors propose a general framework to enable optimistic updates in Multi-agent Policy Gradient (MAPG) methods to mitigate this issue. They employ a Leaky ReLU function where a single hyperparameter selects the degree of optimism to reshape the advantages when updating the policy. The proposed method outperforms strong baselines on 13 out of 19 tested tasks.
Publication date: 3 Nov 2023
Project Page: https://github.com/wenshuaizhao/optimappo
Paper: https://arxiv.org/pdf/2311.01953