Thompson Sampling for Bandit Convex Optimisation
Abstract
We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.
Cite
Text
Bakhtiari et al. "Thompson Sampling for Bandit Convex Optimisation." Proceedings of Thirty Eighth Conference on Learning Theory, 2025.Markdown
[Bakhtiari et al. "Thompson Sampling for Bandit Convex Optimisation." Proceedings of Thirty Eighth Conference on Learning Theory, 2025.](https://mlanthology.org/colt/2025/bakhtiari2025colt-thompson/)BibTeX
@inproceedings{bakhtiari2025colt-thompson,
title = {{Thompson Sampling for Bandit Convex Optimisation}},
author = {Bakhtiari, Alireza and Lattimore, Tor and Szepesvári, Csaba},
booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory},
year = {2025},
pages = {231-263},
volume = {291},
url = {https://mlanthology.org/colt/2025/bakhtiari2025colt-thompson/}
}