Neural Dueling Bandits

Abstract

Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.

Cite

Text

Verma et al. "Neural Dueling Bandits." ICML 2024 Workshops: RLControlTheory, 2024.

Markdown

[Verma et al. "Neural Dueling Bandits." ICML 2024 Workshops: RLControlTheory, 2024.](https://mlanthology.org/icmlw/2024/verma2024icmlw-neural/)

BibTeX

@inproceedings{verma2024icmlw-neural,
  title     = {{Neural Dueling Bandits}},
  author    = {Verma, Arun and Dai, Zhongxiang and Lin, Xiaoqiang and Jaillet, Patrick and Low, Bryan Kian Hsiang},
  booktitle = {ICML 2024 Workshops: RLControlTheory},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/verma2024icmlw-neural/}
}