Bandits with Costly Reward Observations

Abstract

Many Machine Learning applications are based on clever ways of constructing a large dataset from already existing sources in order to avoid the cost of labeling training examples. However, in settings such as content moderation with rapidly changing distributions without automatic ground-truth feedback, this may not be possible. If an algorithm has to pay for reward information, for example by asking a person for feedback, how does this change the exploration/exploitation tradeoff? We study Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit's action. We show the impact of the observation cost on the regret by proving an $\Omega(c^{1/3}T^{2/3})$ lower bound, present a general non-adaptive algorithm which matches the lower bound, and present several competitive adaptive algorithms.

Cite

Text

Tucker et al. "Bandits with Costly Reward Observations." NeurIPS 2022 Workshops: MLSW, 2022.

Markdown

[Tucker et al. "Bandits with Costly Reward Observations." NeurIPS 2022 Workshops: MLSW, 2022.](https://mlanthology.org/neuripsw/2022/tucker2022neuripsw-bandits/)

BibTeX

@inproceedings{tucker2022neuripsw-bandits,
  title     = {{Bandits with Costly Reward Observations}},
  author    = {Tucker, Aaron David and Biddulph, Caleb and Wang, Claire and Joachims, Thorsten},
  booktitle = {NeurIPS 2022 Workshops: MLSW},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/tucker2022neuripsw-bandits/}
}