Bandits with Single-Peaked Preferences and Limited Resources

Abstract

We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences---a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.

Cite

Text

Ben-Porat et al. "Bandits with Single-Peaked Preferences and Limited Resources." International Conference on Learning Representations, 2026.

Markdown

[Ben-Porat et al. "Bandits with Single-Peaked Preferences and Limited Resources." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/benporat2026iclr-bandits/)

BibTeX

@inproceedings{benporat2026iclr-bandits,
  title     = {{Bandits with Single-Peaked Preferences and Limited Resources}},
  author    = {Ben-Porat, Omer and Keinan, Gur and Torkan, Rotem},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/benporat2026iclr-bandits/}
}