Batch Reinforcement Learning with Hyperparameter Gradients

Abstract

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Cite

Text

Lee et al. "Batch Reinforcement Learning with Hyperparameter Gradients." International Conference on Machine Learning, 2020.

Markdown

[Lee et al. "Batch Reinforcement Learning with Hyperparameter Gradients." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/lee2020icml-batch/)

BibTeX

@inproceedings{lee2020icml-batch,
  title     = {{Batch Reinforcement Learning with Hyperparameter Gradients}},
  author    = {Lee, Byungjun and Lee, Jongmin and Vrancx, Peter and Kim, Dongho and Kim, Kee-Eung},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {5725-5735},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/lee2020icml-batch/}
}