High-Confidence Off-Policy Evaluation

Abstract

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

Cite

Text

Thomas et al. "High-Confidence Off-Policy Evaluation." AAAI Conference on Artificial Intelligence, 2015. doi:10.1609/AAAI.V29I1.9541

Markdown

[Thomas et al. "High-Confidence Off-Policy Evaluation." AAAI Conference on Artificial Intelligence, 2015.](https://mlanthology.org/aaai/2015/thomas2015aaai-high/) doi:10.1609/AAAI.V29I1.9541

BibTeX

@inproceedings{thomas2015aaai-high,
  title     = {{High-Confidence Off-Policy Evaluation}},
  author    = {Thomas, Philip S. and Theocharous, Georgios and Ghavamzadeh, Mohammad},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2015},
  pages     = {3000-3006},
  doi       = {10.1609/AAAI.V29I1.9541},
  url       = {https://mlanthology.org/aaai/2015/thomas2015aaai-high/}
}