Delay and Cooperation in Nonstochastic Bandits

Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, whered is a delay parameter. We introduce EXP3-COOP, a cooperative version of the EXP3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order q d + 1 + K d (T lnK), where d is the independence number of the d-th power of the communication graphG. We then show that for any connected graph, ford = p K the regret bound isK 1=4 p T , strictly better than the minimax regret p KT for noncooperating agents. More informed choices ofd lead to bounds which are arbitrarily close to the full information minimax regret p T lnK when G is dense. When G has sparse components, we show that a variant of EXP3-COOP, allowing agents to choose their parameters according to their centrality inG, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

Cite

Text

Cesa-Bianchi et al. "Delay and Cooperation in Nonstochastic Bandits." Annual Conference on Computational Learning Theory, 2016.

Markdown

[Cesa-Bianchi et al. "Delay and Cooperation in Nonstochastic Bandits." Annual Conference on Computational Learning Theory, 2016.](https://mlanthology.org/colt/2016/cesabianchi2016colt-delay/)

BibTeX

@inproceedings{cesabianchi2016colt-delay,
  title     = {{Delay and Cooperation in Nonstochastic Bandits}},
  author    = {Cesa-Bianchi, Nicolò and Gentile, Claudio and Mansour, Yishay and Minora, Alberto},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2016},
  pages     = {605-622},
  url       = {https://mlanthology.org/colt/2016/cesabianchi2016colt-delay/}
}