On Kernelized Multi-Armed Bandits

Abstract

We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization – Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector-valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favourable gains of the proposed strategies in many cases.

Cite

Text

Chowdhury and Gopalan. "On Kernelized Multi-Armed Bandits." International Conference on Machine Learning, 2017.

Markdown

[Chowdhury and Gopalan. "On Kernelized Multi-Armed Bandits." International Conference on Machine Learning, 2017.](https://mlanthology.org/icml/2017/chowdhury2017icml-kernelized/)

BibTeX

@inproceedings{chowdhury2017icml-kernelized,
  title     = {{On Kernelized Multi-Armed Bandits}},
  author    = {Chowdhury, Sayak Ray and Gopalan, Aditya},
  booktitle = {International Conference on Machine Learning},
  year      = {2017},
  pages     = {844-853},
  volume    = {70},
  url       = {https://mlanthology.org/icml/2017/chowdhury2017icml-kernelized/}
}