Non-Monotonic Resource Utilization in the Bandits with Knapsacks Problem

Abstract

Bandits with knapsacks (BwK) is an influential model of sequential decision-making under uncertainty that incorporates resource consumption constraints. In each round, the decision-maker observes an outcome consisting of a reward and a vector of nonnegative resource consumptions, and the budget of each resource is decremented by its consumption. In this paper we introduce a natural generalization of the stochastic BwK problem that allows non-monotonic resource utilization. In each round, the decision-maker observes an outcome consisting of a reward and a vector of resource drifts that can be positive, negative or zero, and the budget of each resource is incremented by its drift. Our main result is a Markov decision process (MDP) policy that has constant regret against a linear programming (LP) relaxation when the decision-maker knows the true outcome distributions. We build upon this to develop a learning algorithm that has logarithmic regret against the same LP relaxation when the decision-maker does not know the true outcome distributions. We also present a reduction from BwK to our model that shows our regret bound matches existing results.

Cite

Text

Kumar and Kleinberg. "Non-Monotonic Resource Utilization in the Bandits with Knapsacks Problem." Neural Information Processing Systems, 2022.

Markdown

[Kumar and Kleinberg. "Non-Monotonic Resource Utilization in the Bandits with Knapsacks Problem." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/kumar2022neurips-nonmonotonic/)

BibTeX

@inproceedings{kumar2022neurips-nonmonotonic,
  title     = {{Non-Monotonic Resource Utilization in the Bandits with Knapsacks Problem}},
  author    = {Kumar, Raunak and Kleinberg, Robert D.},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/kumar2022neurips-nonmonotonic/}
}