Top K Ranking for Multi-Armed Bandit with Noisy Evaluations

Abstract

We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.

Cite

Text

Garcelon et al. "Top K Ranking for Multi-Armed Bandit with Noisy Evaluations." Artificial Intelligence and Statistics, 2022.

Markdown

[Garcelon et al. "Top K Ranking for Multi-Armed Bandit with Noisy Evaluations." Artificial Intelligence and Statistics, 2022.](https://mlanthology.org/aistats/2022/garcelon2022aistats-top/)

BibTeX

@inproceedings{garcelon2022aistats-top,
  title     = {{Top K Ranking for Multi-Armed Bandit with Noisy Evaluations}},
  author    = {Garcelon, Evrard and Avadhanula, Vashist and Lazaric, Alessandro and Pirotta, Matteo},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2022},
  pages     = {6242-6269},
  volume    = {151},
  url       = {https://mlanthology.org/aistats/2022/garcelon2022aistats-top/}
}