Thompson Sampling for Bandit Convex Optimisation

Abstract

We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.

Cite

Text

Bakhtiari et al. "Thompson Sampling for Bandit Convex Optimisation." Proceedings of Thirty Eighth Conference on Learning Theory, 2025.

Markdown

[Bakhtiari et al. "Thompson Sampling for Bandit Convex Optimisation." Proceedings of Thirty Eighth Conference on Learning Theory, 2025.](https://mlanthology.org/colt/2025/bakhtiari2025colt-thompson/)

BibTeX

@inproceedings{bakhtiari2025colt-thompson,
  title     = {{Thompson Sampling for Bandit Convex Optimisation}},
  author    = {Bakhtiari, Alireza and Lattimore, Tor and Szepesvári, Csaba},
  booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory},
  year      = {2025},
  pages     = {231-263},
  volume    = {291},
  url       = {https://mlanthology.org/colt/2025/bakhtiari2025colt-thompson/}
}