Off-Policy Evaluation in Partially Observable Environments

Abstract

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

Cite

Text

Tennenholtz et al. "Off-Policy Evaluation in Partially Observable Environments." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I06.6590

Markdown

[Tennenholtz et al. "Off-Policy Evaluation in Partially Observable Environments." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/tennenholtz2020aaai-off/) doi:10.1609/AAAI.V34I06.6590

BibTeX

@inproceedings{tennenholtz2020aaai-off,
  title     = {{Off-Policy Evaluation in Partially Observable Environments}},
  author    = {Tennenholtz, Guy and Shalit, Uri and Mannor, Shie},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2020},
  pages     = {10276-10283},
  doi       = {10.1609/AAAI.V34I06.6590},
  url       = {https://mlanthology.org/aaai/2020/tennenholtz2020aaai-off/}
}