ConQUR: Mitigating Delusional Bias in Deep Q-Learning

Abstract

Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.

Cite

Text

Su et al. "ConQUR: Mitigating Delusional Bias in Deep Q-Learning." International Conference on Machine Learning, 2020.

Markdown

[Su et al. "ConQUR: Mitigating Delusional Bias in Deep Q-Learning." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/su2020icml-conqur/)

BibTeX

@inproceedings{su2020icml-conqur,
  title     = {{ConQUR: Mitigating Delusional Bias in Deep Q-Learning}},
  author    = {Su, Dijia and Ooi, Jayden and Lu, Tyler and Schuurmans, Dale and Boutilier, Craig},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {9187-9195},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/su2020icml-conqur/}
}