Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement

Glover, Jared; Zhu, Charlotte

doi:10.1609/AAAI.V28I1.9059

Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement

Jared Glover, Charlotte Zhu

AAAI 2014 pp. 2534-2541

doi:10.1609/AAAI.V28I1.9059 /aaai/2014/glover2014aaai-generalizing/

Abstract

We present a ping-pong-playing robot that learns to improve its swings with human advice. Our method learns a reward function over the joint space of task and policy parameters T×P, so the robot can explore policy space more intelligently in a way that trades off exploration vs. exploitation to maximize the total cumulative reward over time. Multimodal stochastic polices can also easily be learned with this approach when the reward function is multimodal in the policy parameters. We extend the recently-developed Gaussian Process Bandit Optimization framework to include exploration-bias advice from human domain experts, using a novel algorithm called Exploration Bias with Directional Advice (EBDA).

PDF AAAI Semantic Scholar

Cite

Text

Glover and Zhu. "Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement." AAAI Conference on Artificial Intelligence, 2014. doi:10.1609/AAAI.V28I1.9059

Markdown

[Glover and Zhu. "Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement." AAAI Conference on Artificial Intelligence, 2014.](https://mlanthology.org/aaai/2014/glover2014aaai-generalizing/) doi:10.1609/AAAI.V28I1.9059

BibTeX

@inproceedings{glover2014aaai-generalizing,
  title     = {{Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement}},
  author    = {Glover, Jared and Zhu, Charlotte},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2014},
  pages     = {2534-2541},
  doi       = {10.1609/AAAI.V28I1.9059},
  url       = {https://mlanthology.org/aaai/2014/glover2014aaai-generalizing/}
}