Offline Reinforcement Learning with Mixture of Deterministic Policies
Abstract
Offline reinforcement learning (RL) has recently attracted considerable attention as an approach for utilizing past experiences to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are outside the data distribution. To mitigate offline RL issues, we propose an algorithm that leverages a mixture of deterministic policies. When the data distribution is multimodal, fitting a policy modeled with a unimodal distribution, such as Gaussian distribution, may lead to interpolation between separate modes, thereby resulting in the value estimation of actions that are outside the data distribution. In our framework, the state-action space is divided by learning discrete latent variables, and the sub-policies corresponding to each region are trained. The proposed algorithm was derived by considering the variational lower bound of the offline RL objective function. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets.
Cite
Text
Osa et al. "Offline Reinforcement Learning with Mixture of Deterministic Policies." Transactions on Machine Learning Research, 2023.Markdown
[Osa et al. "Offline Reinforcement Learning with Mixture of Deterministic Policies." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/osa2023tmlr-offline/)BibTeX
@article{osa2023tmlr-offline,
title = {{Offline Reinforcement Learning with Mixture of Deterministic Policies}},
author = {Osa, Takayuki and Hayashi, Akinobu and Deo, Pranav and Morihira, Naoki and Yoshiike, Takahide},
journal = {Transactions on Machine Learning Research},
year = {2023},
url = {https://mlanthology.org/tmlr/2023/osa2023tmlr-offline/}
}