Bandits with Costly Reward Observations
Abstract
Many Machine Learning applications are based on clever ways of constructing a large dataset from already existing sources in order to avoid the cost of labeling training examples. However, in settings such as content moderation with rapidly changing distributions without automatic ground-truth feedback, this may not be possible. If an algorithm has to pay for reward information, for example by asking a person for feedback, how does this change the exploration/exploitation tradeoff? We study Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit's action. We show the impact of the observation cost on the regret by proving an $\Omega(c^{1/3}T^{2/3})$ lower bound, present a general non-adaptive algorithm which matches the lower bound, and present several competitive adaptive algorithms.
Cite
Text
Tucker et al. "Bandits with Costly Reward Observations." NeurIPS 2022 Workshops: MLSW, 2022.Markdown
[Tucker et al. "Bandits with Costly Reward Observations." NeurIPS 2022 Workshops: MLSW, 2022.](https://mlanthology.org/neuripsw/2022/tucker2022neuripsw-bandits/)BibTeX
@inproceedings{tucker2022neuripsw-bandits,
title = {{Bandits with Costly Reward Observations}},
author = {Tucker, Aaron David and Biddulph, Caleb and Wang, Claire and Joachims, Thorsten},
booktitle = {NeurIPS 2022 Workshops: MLSW},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/tucker2022neuripsw-bandits/}
}