Multinomial Logit Contextual Bandits
Abstract
We consider a dynamic assortment selection problem where the goal is to offer an assortment with cardinality constraint $K$ from a set of $N$ possible items. The sequence of assortments can be chosen as a function of the contextual information of items, and possibly users, and the goal is to maximize the expected cumulative rewards, or alternatively, minimize the expected regret. The distinguishing feature in our work is that feedback, i.e. the item chosen by the user, has a multinomial logistic distribution. We propose upper confidence interval based algorithms for this multinomial logit contextual bandit. The first algorithm is a simple and computationally more efficient method which achieves $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds with $d$ dimensional feature vectors. The second algorithm inspired by the work of \cite{li2017provably} achieves an $\tilde{O}(\sqrt{dT})$ with logarithmic dependence on $N$ and increased computational complexity because of pruning processes.
Cite
Text
Oh and Iyengar. "Multinomial Logit Contextual Bandits." ICML 2019 Workshops: RL4RealLife, 2019.Markdown
[Oh and Iyengar. "Multinomial Logit Contextual Bandits." ICML 2019 Workshops: RL4RealLife, 2019.](https://mlanthology.org/icmlw/2019/oh2019icmlw-multinomial/)BibTeX
@inproceedings{oh2019icmlw-multinomial,
title = {{Multinomial Logit Contextual Bandits}},
author = {Oh, Min-hwan and Iyengar, Garud},
booktitle = {ICML 2019 Workshops: RL4RealLife},
year = {2019},
url = {https://mlanthology.org/icmlw/2019/oh2019icmlw-multinomial/}
}