Preference-Based Policy Learning

Abstract

Many machine learning approaches in robotics, based on reinforcement learning, inverse optimal control or direct policy learning, critically rely on robot simulators. This paper investigates a simulatorfree direct policy learning, called Preference  −  based Policy Learning (PPL). PPL iterates a four-step process: the robot demonstrates a candidate policy; the expert ranks this policy comparatively to other ones according to her preferences; these preferences are used to learn a policy return estimate; the robot uses the policy return estimate to build new candidate policies, and the process is iterated until the desired behavior is obtained. PPL requires a good representation of the policy search space be available, enabling one to learn accurate policy return estimates and limiting the human ranking effort needed to yield a good policy. Furthermore, this representation cannot use informed features (e.g., how far the robot is from any target) due to the simulator-free setting. As a second contribution, this paper proposes a representation based on the agnostic exploitation of the robotic log. The convergence of PPL is analytically studied and its experimental validation on two problems, involving a single robot in a maze and two interacting robots, is presented.

Cite

Text

Akrour et al. "Preference-Based Policy Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2011. doi:10.1007/978-3-642-23780-5_11

Markdown

[Akrour et al. "Preference-Based Policy Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2011.](https://mlanthology.org/ecmlpkdd/2011/akrour2011ecmlpkdd-preferencebased/) doi:10.1007/978-3-642-23780-5_11

BibTeX

@inproceedings{akrour2011ecmlpkdd-preferencebased,
  title     = {{Preference-Based Policy Learning}},
  author    = {Akrour, Riad and Schoenauer, Marc and Sebag, Michèle},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2011},
  pages     = {12-27},
  doi       = {10.1007/978-3-642-23780-5_11},
  url       = {https://mlanthology.org/ecmlpkdd/2011/akrour2011ecmlpkdd-preferencebased/}
}