Thompson Sampling for the MNL-Bandit

Abstract

We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

Cite

Text

Agrawal et al. "Thompson Sampling for the MNL-Bandit." Proceedings of the 2017 Conference on Learning Theory, 2017.

Markdown

[Agrawal et al. "Thompson Sampling for the MNL-Bandit." Proceedings of the 2017 Conference on Learning Theory, 2017.](https://mlanthology.org/colt/2017/agrawal2017colt-thompson/)

BibTeX

@inproceedings{agrawal2017colt-thompson,
  title     = {{Thompson Sampling for the MNL-Bandit}},
  author    = {Agrawal, Shipra and Avadhanula, Vashist and Goyal, Vineet and Zeevi, Assaf},
  booktitle = {Proceedings of the 2017 Conference on Learning Theory},
  year      = {2017},
  pages     = {76-78},
  volume    = {65},
  url       = {https://mlanthology.org/colt/2017/agrawal2017colt-thompson/}
}