Variational Inference for Policy Search in Changing Situations
Abstract
Many policy search algorithms minimize the Kullback-Leibler (KL) divergence to a certain target distribution in order to fit their policy. The commonly used KL-divergence forces the resulting policy to be 'reward-attracted'. The policy tries to reproduce all positively rewarded experience while negative experience is neglected. However, the KL-divergence is not symmetric and we can also minimize the the reversed KL-divergence, which is typically used in variational inference. The policy now becomes 'cost-averse'. It tries to avoid reproducing any negatively-rewarded experience while maximizing exploration. Due to this 'cost-averseness' of the policy, Variational Inference for Policy Search (VIP) has several interesting properties. It requires no kernel-bandwith nor exploration rate, such settings are determined automatically by the inference. The algorithm meets the performance of state-of-the-art methods while being applicable to simultaneously learning in multiple situations. We concentrate on using VIP for policy search in robotics. We apply our algorithm to learn dynamic counterbalancing of different kinds of pushes with a human-like 4-link robot.
Cite
Text
Neumann. "Variational Inference for Policy Search in Changing Situations." International Conference on Machine Learning, 2011.Markdown
[Neumann. "Variational Inference for Policy Search in Changing Situations." International Conference on Machine Learning, 2011.](https://mlanthology.org/icml/2011/neumann2011icml-variational/)BibTeX
@inproceedings{neumann2011icml-variational,
title = {{Variational Inference for Policy Search in Changing Situations}},
author = {Neumann, Gerhard},
booktitle = {International Conference on Machine Learning},
year = {2011},
pages = {817-824},
url = {https://mlanthology.org/icml/2011/neumann2011icml-variational/}
}