Bandits with Single-Peaked Preferences and Limited Resources
Abstract
We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences---a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
Cite
Text
Ben-Porat et al. "Bandits with Single-Peaked Preferences and Limited Resources." International Conference on Learning Representations, 2026.Markdown
[Ben-Porat et al. "Bandits with Single-Peaked Preferences and Limited Resources." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/benporat2026iclr-bandits/)BibTeX
@inproceedings{benporat2026iclr-bandits,
title = {{Bandits with Single-Peaked Preferences and Limited Resources}},
author = {Ben-Porat, Omer and Keinan, Gur and Torkan, Rotem},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/benporat2026iclr-bandits/}
}