P-Mean Regret for Stochastic Bandits

Abstract

In this work, we extend the concept of the p-mean welfare objective from social choice theory to study p-mean regret in stochastic multi-armed bandit problems. The p-mean regret, defined as the difference between the optimal mean among the arms and the p-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter p. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel p-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer et al. (2002). Under mild assumptions, we prove that our algorithm achieves a p-mean regret bound of Otilde( sqrt( k / T^1/(2|p|) ) ) for all p

Cite

Text

Krishna et al. "P-Mean Regret for Stochastic Bandits." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I17.33976

Markdown

[Krishna et al. "P-Mean Regret for Stochastic Bandits." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/krishna2025aaai-p/) doi:10.1609/AAAI.V39I17.33976

BibTeX

@inproceedings{krishna2025aaai-p,
  title     = {{P-Mean Regret for Stochastic Bandits}},
  author    = {Krishna, Anand and John, Philips George and Barik, Adarsh and Tan, Vincent Y. F.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {17966-17973},
  doi       = {10.1609/AAAI.V39I17.33976},
  url       = {https://mlanthology.org/aaai/2025/krishna2025aaai-p/}
}