An Instrumental Variable Approach to Confounded Off-Policy Evaluation

Abstract

Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In many cases, there exist unmeasured variables that confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded sequential decision making. Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy’s value in infinite horizon settings as well. Furthermore, we propose a number of policy value estimators and illustrate their effectiveness through extensive simulations and real data analysis from a world-leading short-video platform.

Cite

Text

Xu et al. "An Instrumental Variable Approach to Confounded Off-Policy Evaluation." International Conference on Machine Learning, 2023.

Markdown

[Xu et al. "An Instrumental Variable Approach to Confounded Off-Policy Evaluation." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/xu2023icml-instrumental/)

BibTeX

@inproceedings{xu2023icml-instrumental,
  title     = {{An Instrumental Variable Approach to Confounded Off-Policy Evaluation}},
  author    = {Xu, Yang and Zhu, Jin and Shi, Chengchun and Luo, Shikai and Song, Rui},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {38848-38880},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/xu2023icml-instrumental/}
}