A One-Size-Fits-All Solution to Conservative Bandit Problems
Abstract
In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.
Cite
Text
Du et al. "A One-Size-Fits-All Solution to Conservative Bandit Problems." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I8.16891Markdown
[Du et al. "A One-Size-Fits-All Solution to Conservative Bandit Problems." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/du2021aaai-one/) doi:10.1609/AAAI.V35I8.16891BibTeX
@inproceedings{du2021aaai-one,
title = {{A One-Size-Fits-All Solution to Conservative Bandit Problems}},
author = {Du, Yihan and Wang, Siwei and Huang, Longbo},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2021},
pages = {7254-7261},
doi = {10.1609/AAAI.V35I8.16891},
url = {https://mlanthology.org/aaai/2021/du2021aaai-one/}
}