Bias No More: High-Probability Data-Dependent Regret Bounds for Adversarial Bandits and MDPs

Abstract

We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality.

Cite

Text

Lee et al. "Bias No More: High-Probability Data-Dependent Regret Bounds for Adversarial Bandits and MDPs." Neural Information Processing Systems, 2020.

Markdown

[Lee et al. "Bias No More: High-Probability Data-Dependent Regret Bounds for Adversarial Bandits and MDPs." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/lee2020neurips-bias/)

BibTeX

@inproceedings{lee2020neurips-bias,
  title     = {{Bias No More: High-Probability Data-Dependent Regret Bounds for Adversarial Bandits and MDPs}},
  author    = {Lee, Chung-Wei and Luo, Haipeng and Wei, Chen-Yu and Zhang, Mengxiao},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/lee2020neurips-bias/}
}