Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm

Abstract

We study the K-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. The hardness of recommending Copeland winners, the arms that beat the greatest number of other arms, is characterized by deriving an asymptotic regret bound. We propose Copeland Winners Deterministic Minimum Empirical Divergence (CW-RMED), an algorithm inspired by the DMED algorithm (Honda and Takemura, 2010), and derive an asymptotically optimal regret bound for it. However, it is not known whether the algorithm can be efficiently computed or not. To address this issue, we devise an efficient version (ECW-RMED) and derive its asymptotic regret bound. Experimental comparisons of dueling bandit algorithms show that ECW-RMED significantly outperforms existing ones.

Cite

Text

Komiyama et al. "Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm." International Conference on Machine Learning, 2016.

Markdown

[Komiyama et al. "Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm." International Conference on Machine Learning, 2016.](https://mlanthology.org/icml/2016/komiyama2016icml-copeland/)

BibTeX

@inproceedings{komiyama2016icml-copeland,
  title     = {{Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm}},
  author    = {Komiyama, Junpei and Honda, Junya and Nakagawa, Hiroshi},
  booktitle = {International Conference on Machine Learning},
  year      = {2016},
  pages     = {1235-1244},
  volume    = {48},
  url       = {https://mlanthology.org/icml/2016/komiyama2016icml-copeland/}
}