Policy-Gradients for PSRs and POMDPs

Abstract

In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.

Cite

Text

Aberdeen et al. "Policy-Gradients for PSRs and POMDPs." Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007.

Markdown

[Aberdeen et al. "Policy-Gradients for PSRs and POMDPs." Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007.](https://mlanthology.org/aistats/2007/aberdeen2007aistats-policygradients/)

BibTeX

@inproceedings{aberdeen2007aistats-policygradients,
  title     = {{Policy-Gradients for PSRs and POMDPs}},
  author    = {Aberdeen, Douglas and Buffet, Olivier and Thomas, Owen},
  booktitle = {Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics},
  year      = {2007},
  pages     = {3-10},
  volume    = {2},
  url       = {https://mlanthology.org/aistats/2007/aberdeen2007aistats-policygradients/}
}