Non-Stationary Off-Policy Optimization

Abstract

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

Cite

Text

Hong et al. "Non-Stationary Off-Policy Optimization." Artificial Intelligence and Statistics, 2021.

Markdown

[Hong et al. "Non-Stationary Off-Policy Optimization." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/hong2021aistats-nonstationary/)

BibTeX

@inproceedings{hong2021aistats-nonstationary,
  title     = {{Non-Stationary Off-Policy Optimization}},
  author    = {Hong, Joey and Kveton, Branislav and Zaheer, Manzil and Chow, Yinlam and Ahmed, Amr},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2021},
  pages     = {2494-2502},
  volume    = {130},
  url       = {https://mlanthology.org/aistats/2021/hong2021aistats-nonstationary/}
}