Minimax Regret for Cascading Bandits

Abstract

Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.

Cite

Text

Vial et al. "Minimax Regret for Cascading Bandits." Neural Information Processing Systems, 2022.

Markdown

[Vial et al. "Minimax Regret for Cascading Bandits." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/vial2022neurips-minimax/)

BibTeX

@inproceedings{vial2022neurips-minimax,
  title     = {{Minimax Regret for Cascading Bandits}},
  author    = {Vial, Daniel and Sanghavi, Sujay and Shakkottai, Sanjay and Srikant, R.},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/vial2022neurips-minimax/}
}