Optimal Online Learning Procedures for Model-Free Policy Evaluation

Abstract

In this study, we extend the framework of semiparametric statistical inference introduced recently to reinforcement learning [1] to online learning procedures for policy evaluation. This generalization enables us to investigate statistical properties of value function estimators both by batch and online procedures in a unified way in terms of estimating functions. Furthermore, we propose a novel online learning algorithm with optimal estimating functions which achieve the minimum estimation error. Our theoretical developments are confirmed using a simple chain walk problem.

Cite

Text

Ueno et al. "Optimal Online Learning Procedures for Model-Free Policy Evaluation." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009. doi:10.1007/978-3-642-04174-7_31

Markdown

[Ueno et al. "Optimal Online Learning Procedures for Model-Free Policy Evaluation." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009.](https://mlanthology.org/ecmlpkdd/2009/ueno2009ecmlpkdd-optimal/) doi:10.1007/978-3-642-04174-7_31

BibTeX

@inproceedings{ueno2009ecmlpkdd-optimal,
  title     = {{Optimal Online Learning Procedures for Model-Free Policy Evaluation}},
  author    = {Ueno, Tsuyoshi and Maeda, Shin-ichi and Kawanabe, Motoaki and Ishii, Shin},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2009},
  pages     = {473-488},
  doi       = {10.1007/978-3-642-04174-7_31},
  url       = {https://mlanthology.org/ecmlpkdd/2009/ueno2009ecmlpkdd-optimal/}
}