Adversarial Multi-Dueling Bandits

Abstract

We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm, based on an arbitrary preference matrix, possibly chosen adversarially. We introduce a novel algorithm, MIDEX, to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MIDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Cite

Text

Gajane. "Adversarial Multi-Dueling Bandits." ICML 2024 Workshops: MFHAIA, 2024.

Markdown

[Gajane. "Adversarial Multi-Dueling Bandits." ICML 2024 Workshops: MFHAIA, 2024.](https://mlanthology.org/icmlw/2024/gajane2024icmlw-adversarial/)

BibTeX

@inproceedings{gajane2024icmlw-adversarial,
  title     = {{Adversarial Multi-Dueling Bandits}},
  author    = {Gajane, Pratik},
  booktitle = {ICML 2024 Workshops: MFHAIA},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/gajane2024icmlw-adversarial/}
}