Divergence-Augmented Policy Optimization

Abstract

In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data. The Bregman divergence is calculated between the state distributions of two policies, instead of only on the action probabilities, leading to a divergence augmentation formulation. Empirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.

Cite

Text

Wang et al. "Divergence-Augmented Policy Optimization." Neural Information Processing Systems, 2019.

Markdown

[Wang et al. "Divergence-Augmented Policy Optimization." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/wang2019neurips-divergenceaugmented/)

BibTeX

@inproceedings{wang2019neurips-divergenceaugmented,
  title     = {{Divergence-Augmented Policy Optimization}},
  author    = {Wang, Qing and Li, Yingru and Xiong, Jiechao and Zhang, Tong},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {6099-6110},
  url       = {https://mlanthology.org/neurips/2019/wang2019neurips-divergenceaugmented/}
}