Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game
Abstract
The deep policy gradient method has demonstrated promising results in many large-scale games, where the agent learns purely from its own experience. Yet, policy gradient methods with self-play suffer convergence problems to a Nash Equilibrium (NE) in multi-agent situations. Counterfactual regret minimization (CFR) has a convergence guarantee to a NE in 2-player zero-sum games, but it usually needs domain-specific abstractions to deal with large-scale games. Inheriting merits from both methods, in this paper we extend the actor-critic algorithm framework in deep reinforcement learning to tackle a large-scale 2-player zero-sum imperfect-information game, 1-on-1 Mahjong, whose information set size and game length are much larger than poker. The proposed algorithm, named Actor-Critic Hedge (ACH), modifies the policy optimization objective from originally maximizing the discounted returns to minimizing a type of weighted cumulative counterfactual regret. This modification is achieved by approximating the regret via a deep neural network and minimizing the regret via generating self-play policies using Hedge. ACH is theoretically justified as it is derived from a neural-based weighted CFR, for which we prove the convergence to a NE under certain conditions. Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods. Also, the agent obtained by ACH defeats a human champion in 1-on-1 Mahjong.
Cite
Text
Fu et al. "Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game." International Conference on Learning Representations, 2022.Markdown
[Fu et al. "Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/fu2022iclr-actorcritic/)BibTeX
@inproceedings{fu2022iclr-actorcritic,
title = {{Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game}},
author = {Fu, Haobo and Liu, Weiming and Wu, Shuang and Wang, Yijia and Yang, Tao and Li, Kai and Xing, Junliang and Li, Bin and Ma, Bo and Fu, Qiang and Wei, Yang},
booktitle = {International Conference on Learning Representations},
year = {2022},
url = {https://mlanthology.org/iclr/2022/fu2022iclr-actorcritic/}
}