CASA: Bridging the Gap Between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration

Abstract

We study the problem of model-free reinforcement learning, which is often solved following the principle of Generalized Policy Iteration (GPI). While GPI is typically an interplay between policy evaluation and policy improvement, most conventional model-free methods with function approximation assume the independence of GPI steps, despite of the inherent connections between them. In this paper, we present a method that attempts to eliminate the inconsistency between policy evaluation step and policy improvement step, leading to a conflict averse GPI solution with gradient-based functional approximation. Our method is capital to balancing exploitation and exploration between policy-based and value-based methods and is applicable to existed policy-based and value-based methods. We conduct extensive experiments to study theoretical properties of our method and demonstrate the effectiveness of our method on Atari 200M benchmark.

Cite

Text

Xiao et al. "CASA: Bridging the Gap Between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration." NeurIPS 2022 Workshops: DeepRL, 2022.

Markdown

[Xiao et al. "CASA: Bridging the Gap Between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration." NeurIPS 2022 Workshops: DeepRL, 2022.](https://mlanthology.org/neuripsw/2022/xiao2022neuripsw-casa/)

BibTeX

@inproceedings{xiao2022neuripsw-casa,
  title     = {{CASA: Bridging the Gap Between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration}},
  author    = {Xiao, Changnan and Shi, Haosen and Fan, Jiajun and Deng, Shihong and Yin, Haiyan},
  booktitle = {NeurIPS 2022 Workshops: DeepRL},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/xiao2022neuripsw-casa/}
}