Bandits with Feedback Graphs and Switching Costs
Abstract
We study the adversarial multi-armed bandit problem where the learner is supplied with partial observations modeled by a \emph{feedback graph} and where shifting to a new action incurs a fixed \emph{switching cost}. We give two new algorithms for this problem in the informed setting. Our best algorithm achieves a pseudo-regret of $\tilde O(\gamma(G)^{\frac{1}{3}}T^{\frac{2}{3}})$, where $\gamma(G)$ is the domination number of the feedback graph. This significantly improves upon the previous best result for the same problem, which was based on the independence number of $G$. We also present matching lower bounds for our result that we describe in detail. Finally, we give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.
Cite
Text
Arora et al. "Bandits with Feedback Graphs and Switching Costs." Neural Information Processing Systems, 2019.Markdown
[Arora et al. "Bandits with Feedback Graphs and Switching Costs." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/arora2019neurips-bandits/)BibTeX
@inproceedings{arora2019neurips-bandits,
title = {{Bandits with Feedback Graphs and Switching Costs}},
author = {Arora, Raman and Marinov, Teodor Vanislavov and Mohri, Mehryar},
booktitle = {Neural Information Processing Systems},
year = {2019},
pages = {10397-10407},
url = {https://mlanthology.org/neurips/2019/arora2019neurips-bandits/}
}