Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

Abstract

We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

Cite

Text

Ryu et al. "Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing." Proceedings of Thirty Eighth Conference on Learning Theory, 2025.

Markdown

[Ryu et al. "Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing." Proceedings of Thirty Eighth Conference on Learning Theory, 2025.](https://mlanthology.org/colt/2025/ryu2025colt-improved/)

BibTeX

@inproceedings{ryu2025colt-improved,
  title     = {{Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing}},
  author    = {Ryu, J. Jon and Kwon, Jeongyeol and Koppe, Benjamin and Jun, Kwang-Sung},
  booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory},
  year      = {2025},
  pages     = {5015-5053},
  volume    = {291},
  url       = {https://mlanthology.org/colt/2025/ryu2025colt-improved/}
}