A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Abstract

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending the existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

Cite

Text

Lazic et al. "A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs." Neural Information Processing Systems, 2020.

Markdown

[Lazic et al. "A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/lazic2020neurips-maximumentropy/)

BibTeX

@inproceedings{lazic2020neurips-maximumentropy,
  title     = {{A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs}},
  author    = {Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and Gorur, Dilan and Harris, Chris and Schuurmans, Dale},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/lazic2020neurips-maximumentropy/}
}