Variance-Aware Off-Policy Evaluation with Linear Function Approximation
Abstract
We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, \texttt{VA-OPE}, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
Cite
Text
Min et al. "Variance-Aware Off-Policy Evaluation with Linear Function Approximation." Neural Information Processing Systems, 2021.Markdown
[Min et al. "Variance-Aware Off-Policy Evaluation with Linear Function Approximation." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/min2021neurips-varianceaware/)BibTeX
@inproceedings{min2021neurips-varianceaware,
title = {{Variance-Aware Off-Policy Evaluation with Linear Function Approximation}},
author = {Min, Yifei and Wang, Tianhao and Zhou, Dongruo and Gu, Quanquan},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/min2021neurips-varianceaware/}
}