Constrained Imitation Q-Learning with Earth Mover’s Distance Reward

Abstract

We propose constrained Earth Mover's Distance (CEMD) Imitation Q-learning that combines exploration of Reinforcement Learning (RL) and the sample efficiency of Imitation Learning (IL). Sample efficiency makes CEMD suitable for robot learning. Immediate rewards can be efficiently computed by a greedy Earth Mover's Distance (EMD) variant between observed state-action pairs and state-actions in the stored expert demonstrations. In CEMD, we constrain the previously proposed non-stationary greedy EMD reward by proposing a greedy EMD upper bound estimate and a generic Q-learning lower bound. In PyBullet continuous control benchmarks, CEMD is more sample efficient, achieves higher performance, and yields less variance than its competitors.

Cite

Text

Yang et al. "Constrained Imitation Q-Learning with Earth Mover’s Distance Reward." NeurIPS 2022 Workshops: DeepRL, 2022.

Markdown

[Yang et al. "Constrained Imitation Q-Learning with Earth Mover’s Distance Reward." NeurIPS 2022 Workshops: DeepRL, 2022.](https://mlanthology.org/neuripsw/2022/yang2022neuripsw-constrained/)

BibTeX

@inproceedings{yang2022neuripsw-constrained,
  title     = {{Constrained Imitation Q-Learning with Earth Mover’s Distance Reward}},
  author    = {Yang, Wenyan and Strokina, Nataliya and Pajarinen, Joni and Kamarainen, Joni-kristian},
  booktitle = {NeurIPS 2022 Workshops: DeepRL},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/yang2022neuripsw-constrained/}
}