Instance-Optimal PAC Algorithms for Contextual Bandits

Abstract

In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(\epsilon,\delta)$-PAC setting: given a policy class $\Pi$ the goal of the learner is to return a policy $\pi\in \Pi$ whose expected reward is within $\epsilon$ of the optimal policy with probability greater than $1-\delta$. We characterize the first instance-dependent PAC sample complexity of contextual bandits through a quantity $\rho_{\Pi}$, and provide matching upper and lower bounds in terms of $\rho_{\Pi}$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to a cost-sensitive classification oracle.

Cite

Text

Li et al. "Instance-Optimal PAC Algorithms for Contextual Bandits." Neural Information Processing Systems, 2022.

Markdown

[Li et al. "Instance-Optimal PAC Algorithms for Contextual Bandits." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/li2022neurips-instanceoptimal/)

BibTeX

@inproceedings{li2022neurips-instanceoptimal,
  title     = {{Instance-Optimal PAC Algorithms for Contextual Bandits}},
  author    = {Li, Zhaoqi and Ratliff, Lillian and Nassif, Houssam and Jamieson, Kevin G. and Jain, Lalit},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/li2022neurips-instanceoptimal/}
}