General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence

Abstract

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence---called the Tsallis KL divergence. Tsallis KL defined by the $q$-logarithm is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.

Cite

Text

Zhu et al. "General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence." Neural Information Processing Systems, 2023.

Markdown

[Zhu et al. "General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/zhu2023neurips-general/)

BibTeX

@inproceedings{zhu2023neurips-general,
  title     = {{General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence}},
  author    = {Zhu, Lingwei and Chen, Zheng and Schlegel, Matthew and White, Martha},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/zhu2023neurips-general/}
}