Density Estimation for Conservative Q-Learning

Abstract

Batch Reinforcement Learning algorithms aim at learning the best policy from a batch of data without interacting with the environment. Within this setting, one difficulty is to correctly assess the value of state-action pairs far from the data set. Indeed, the lack of information may provoke an overestimation of the value function, leading to non-desirable behaviours. A compromise between enhancing the performance of the behaviour policy and staying close to it must be found. To alleviate this issue, most existing approaches introduce a regularization term to favor state-action pairs from the data set. In this paper, we refine this idea by estimating the density of these state-action pairs to distinguish neighbourhoods. The resulting regularization guides the policy toward meaningful unseen regions, improving the learning process. We hence introduce Density Conservative Q-Learning (D-CQL), a sound batch RL algorithm that carefully penalizes the value function based on the information collected in the state-action space. The performance of our approach is outlined on many classical benchmark in batch RL.

Cite

Text

Daoudi et al. "Density Estimation for Conservative Q-Learning." ICLR 2022 Workshops: GPL, 2022.

Markdown

[Daoudi et al. "Density Estimation for Conservative Q-Learning." ICLR 2022 Workshops: GPL, 2022.](https://mlanthology.org/iclrw/2022/daoudi2022iclrw-density/)

BibTeX

@inproceedings{daoudi2022iclrw-density,
  title     = {{Density Estimation for Conservative Q-Learning}},
  author    = {Daoudi, Paul and Dos Santos, Ludovic and Barlier, Merwan and Virmaux, Aladin},
  booktitle = {ICLR 2022 Workshops: GPL},
  year      = {2022},
  url       = {https://mlanthology.org/iclrw/2022/daoudi2022iclrw-density/}
}