Bandits with Costly Reward Observations

Abstract

Many machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit’s action. We show that the observation cost implies an $\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.

Cite

Text

Tucker et al. "Bandits with Costly Reward Observations." Uncertainty in Artificial Intelligence, 2023.

Markdown

[Tucker et al. "Bandits with Costly Reward Observations." Uncertainty in Artificial Intelligence, 2023.](https://mlanthology.org/uai/2023/tucker2023uai-bandits/)

BibTeX

@inproceedings{tucker2023uai-bandits,
  title     = {{Bandits with Costly Reward Observations}},
  author    = {Tucker, Aaron D. and Biddulph, Caleb and Wang, Claire and Joachims, Thorsten},
  booktitle = {Uncertainty in Artificial Intelligence},
  year      = {2023},
  pages     = {2147-2156},
  volume    = {216},
  url       = {https://mlanthology.org/uai/2023/tucker2023uai-bandits/}
}