Contextual Bandits with Linear Payoff Functions
Abstract
In this paper we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For $T$ rounds, $K$ actions, and d dimensional feature vectors, we prove an $O\left(\sqrt{Td\ln^3(KT\ln(T)/\delta)}\right)$ regret bound that holds with probability $1-\delta$ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of $\Omega(\sqrt{Td})$ for this setting, matching the upper bound up to logarithmic factors.
Cite
Text
Chu et al. "Contextual Bandits with Linear Payoff Functions." Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.Markdown
[Chu et al. "Contextual Bandits with Linear Payoff Functions." Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.](https://mlanthology.org/aistats/2011/chu2011aistats-contextual/)BibTeX
@inproceedings{chu2011aistats-contextual,
title = {{Contextual Bandits with Linear Payoff Functions}},
author = {Chu, Wei and Li, Lihong and Reyzin, Lev and Schapire, Robert},
booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics},
year = {2011},
pages = {208-214},
volume = {15},
url = {https://mlanthology.org/aistats/2011/chu2011aistats-contextual/}
}