Bandits with Costly Reward Observations
Abstract
Many machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit’s action. We show that the observation cost implies an $\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.
Cite
Text
Tucker et al. "Bandits with Costly Reward Observations." Uncertainty in Artificial Intelligence, 2023.Markdown
[Tucker et al. "Bandits with Costly Reward Observations." Uncertainty in Artificial Intelligence, 2023.](https://mlanthology.org/uai/2023/tucker2023uai-bandits/)BibTeX
@inproceedings{tucker2023uai-bandits,
title = {{Bandits with Costly Reward Observations}},
author = {Tucker, Aaron D. and Biddulph, Caleb and Wang, Claire and Joachims, Thorsten},
booktitle = {Uncertainty in Artificial Intelligence},
year = {2023},
pages = {2147-2156},
volume = {216},
url = {https://mlanthology.org/uai/2023/tucker2023uai-bandits/}
}