Distance Minimization for Reward Learning from Scored Trajectories
Abstract
Many planning methods rely on the use of an immediate reward function as a portable and succinct representation of desired behavior. Rewards are often inferred from demonstrated behavior that is assumed to be near-optimal. We examine a framework, Distance Minimization IRL (DM-IRL), for learning reward functions from scores an expert assigns to possibly suboptimal demonstrations. By changing the expert’s role from a demonstrator to a judge, DM-IRL relaxes some of the assumptions present in IRL, enabling learning from the scoring of arbitrary demonstration trajectories with unknown transition functions. DM-IRL complements existing IRL approaches by addressing different assumptions about the expert. We show that DM-IRL is robust to expert scoring error and prove that finding a policy that produces maximally informative trajectories for an expert to score is strongly NP-hard. Experimentally, we demonstrate that the reward function DM-IRL learns from an MDP with an unknown transition model can transfer to an agent with known characteristics in a novel environment, and we achieve successful learning with limited available training data.
Cite
Text
Burchfiel et al. "Distance Minimization for Reward Learning from Scored Trajectories." AAAI Conference on Artificial Intelligence, 2016. doi:10.1609/AAAI.V30I1.10411Markdown
[Burchfiel et al. "Distance Minimization for Reward Learning from Scored Trajectories." AAAI Conference on Artificial Intelligence, 2016.](https://mlanthology.org/aaai/2016/burchfiel2016aaai-distance/) doi:10.1609/AAAI.V30I1.10411BibTeX
@inproceedings{burchfiel2016aaai-distance,
title = {{Distance Minimization for Reward Learning from Scored Trajectories}},
author = {Burchfiel, Benjamin and Tomasi, Carlo and Parr, Ronald},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2016},
pages = {3330-3336},
doi = {10.1609/AAAI.V30I1.10411},
url = {https://mlanthology.org/aaai/2016/burchfiel2016aaai-distance/}
}