Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog

Abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. This is a critical shortcoming for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms which use KL-control to penalize divergence from a pre-trained prior model of probable actions. This KL-constraint reduces extrapolation error, enabling effective offline learning, without exploration, from a fixed batch of data. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. This Way Off-Policy (WOP) algorithm is tested on both traditional RL tasks from OpenAI Gym, and on the problem of open-domain dialog generation; a challenging reinforcement learning problem with a 20,000 dimensional action space. WOP allows for the extraction of multiple different reward functions post-hoc from collected human interaction data, and can learn effectively from all of these. We test real-world generalization by deploying dialog models live to converse with humans in an open-domain setting, and demonstrate that WOP achieves significant improvements over state-of-the-art prior methods in batch deep RL.

Cite

Text

Jaques et al. "Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog." International Conference on Learning Representations, 2020.

Markdown

[Jaques et al. "Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/jaques2020iclr-way/)

BibTeX

@inproceedings{jaques2020iclr-way,
  title     = {{Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog}},
  author    = {Jaques, Natasha and Ghandeharioun, Asma and Shen, Judy Hanwen and Ferguson, Craig and Lapedriza, Agata and Jones, Noah and Gu, Shixiang and Picard, Rosalind},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/jaques2020iclr-way/}
}