Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning

Abstract

This work presents the first theoretically justified simultaneous inference framework for off-policy evaluation (OPE). In contrast to existing methods that focus on point estimates or pointwise confidence intervals (CIs), the new framework quantifies global uncertainty across an infinite or continuous initial state space, offering valid inference over the entire state space. Our method leverages sieve-based Q-function estimation and (high-dimensional) Gaussian approximation techniques over convex regions, which further motivates a new multiplier bootstrap algorithm for constructing asymptotically correct simultaneous confidence regions (SCRs). The widths of the SCRs exceed those of the pointwise CIs by only a logarithmic factor, indicating that our procedure is nearly optimal in terms of efficiency. The effectiveness of the proposed approach is demonstrated through simulations and analysis of the OhioT1DM dataset.

Cite

Text

Luo et al. "Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Luo et al. "Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/luo2025neurips-simultaneous/)

BibTeX

@inproceedings{luo2025neurips-simultaneous,
  title     = {{Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning}},
  author    = {Luo, Tianpai and Fan, Xinyuan and Wu, Weichi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/luo2025neurips-simultaneous/}
}