Thompson Sampling for the MNL-Bandit
Abstract
We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.
Cite
Text
Agrawal et al. "Thompson Sampling for the MNL-Bandit." Proceedings of the 2017 Conference on Learning Theory, 2017.Markdown
[Agrawal et al. "Thompson Sampling for the MNL-Bandit." Proceedings of the 2017 Conference on Learning Theory, 2017.](https://mlanthology.org/colt/2017/agrawal2017colt-thompson/)BibTeX
@inproceedings{agrawal2017colt-thompson,
title = {{Thompson Sampling for the MNL-Bandit}},
author = {Agrawal, Shipra and Avadhanula, Vashist and Goyal, Vineet and Zeevi, Assaf},
booktitle = {Proceedings of the 2017 Conference on Learning Theory},
year = {2017},
pages = {76-78},
volume = {65},
url = {https://mlanthology.org/colt/2017/agrawal2017colt-thompson/}
}