Disagreement-Regularized Imitation Learning

Abstract

We present a simple and effective algorithm designed to address the covariate shift problem in imitation learning. It operates by training an ensemble of policies on the expert demonstration data, and using the variance of their predictions as a cost which is minimized with RL together with a supervised behavioral cloning cost. Unlike adversarial imitation methods, it uses a fixed reward function which is easy to optimize. We prove a regret bound for the algorithm which is linear in the time horizon multiplied by a coefficient which we show to be low for certain problems in which behavioral cloning fails. We evaluate our algorithm empirically across multiple pixel-based Atari environments and continuous control tasks, and show that it matches or significantly outperforms behavioral cloning and generative adversarial imitation learning.

Cite

Text

Brantley et al. "Disagreement-Regularized Imitation Learning." International Conference on Learning Representations, 2020.

Markdown

[Brantley et al. "Disagreement-Regularized Imitation Learning." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/brantley2020iclr-disagreementregularized/)

BibTeX

@inproceedings{brantley2020iclr-disagreementregularized,
  title     = {{Disagreement-Regularized Imitation Learning}},
  author    = {Brantley, Kiante and Sun, Wen and Henaff, Mikael},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/brantley2020iclr-disagreementregularized/}
}