Entropy-Regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Abstract

Diffusion policy has shown a strong ability to express complex action distributions in offline reinforcement learning (RL). However, it suffers from overestimating Q-value functions on out-of-distribution (OOD) data points due to the offline dataset limitation. To address it, this paper proposes a novel entropy-regularized diffusion policy and takes into account the confidence of the Q-value prediction with Q-ensembles. At the core of our diffusion policy is a mean-reverting stochastic differential equation (SDE) that transfers the action distribution into a standard Gaussian form and then samples actions conditioned on the environment state with a corresponding reverse-time process. We show that the entropy of such a policy is tractable and that can be used to increase the exploration of OOD samples in offline RL training. Moreover, we propose using the lower confidence bound of Q-ensembles for pessimistic Q-value function estimation. The proposed approach demonstrates state-of-the-art performance across a range of tasks in the D4RL benchmarks, significantly improving upon existing diffusion-based policies. The code is available at https://github.com/ruoqizzz/entropy-offlineRL.

Cite

Text

Zhang et al. "Entropy-Regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning." Neural Information Processing Systems, 2024. doi:10.52202/079017-3138

Markdown

[Zhang et al. "Entropy-Regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhang2024neurips-entropyregularized/) doi:10.52202/079017-3138

BibTeX

@inproceedings{zhang2024neurips-entropyregularized,
  title     = {{Entropy-Regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning}},
  author    = {Zhang, Ruoqi and Luo, Ziwei and Sjölund, Jens and Schön, Thomas B. and Mattsson, Per},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3138},
  url       = {https://mlanthology.org/neurips/2024/zhang2024neurips-entropyregularized/}
}