SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations

Abstract

We consider offline safe imitation learning (IL), where the agent aims to learn the safe policy that mimics preferred behavior while avoiding non-preferred behavior from non-preferred demonstrations and unlabeled demonstrations. This problem setting corresponds to various real-world scenarios, where satisfying safety constraints is more important than maximizing the expected return. However, it is very challenging to learn the policy to avoid constraint-violating (i.e. non-preferred) behavior, as opposed to standard imitation learning which learns the policy to mimic given demonstrations. In this paper, we present a hyperparameter-free offline safe IL algorithm, SafeDICE, that learns safe policy by leveraging the non-preferred demonstrations in the space of stationary distributions. Our algorithm directly estimates the stationary distribution corrections of the policy that imitate the demonstrations excluding the non-preferred behavior. In the experiments, we demonstrate that our algorithm learns a more safe policy that satisfies the cost constraint without degrading the reward performance, compared to baseline algorithms.

Cite

Text

Jang et al. "SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations." Neural Information Processing Systems, 2023.

Markdown

[Jang et al. "SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/jang2023neurips-safedice/)

BibTeX

@inproceedings{jang2023neurips-safedice,
  title     = {{SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations}},
  author    = {Jang, Youngsoo and Kim, Geon-Hyeong and Lee, Jongmin and Sohn, Sungryull and Kim, Byoungjip and Lee, Honglak and Lee, Moontae},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/jang2023neurips-safedice/}
}