Policy-Gradients for PSRs and POMDPs
Abstract
In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.
Cite
Text
Aberdeen et al. "Policy-Gradients for PSRs and POMDPs." Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007.Markdown
[Aberdeen et al. "Policy-Gradients for PSRs and POMDPs." Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007.](https://mlanthology.org/aistats/2007/aberdeen2007aistats-policygradients/)BibTeX
@inproceedings{aberdeen2007aistats-policygradients,
title = {{Policy-Gradients for PSRs and POMDPs}},
author = {Aberdeen, Douglas and Buffet, Olivier and Thomas, Owen},
booktitle = {Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics},
year = {2007},
pages = {3-10},
volume = {2},
url = {https://mlanthology.org/aistats/2007/aberdeen2007aistats-policygradients/}
}