Delayed Feedback in Generalised Linear Bandits Revisited

Abstract

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.

Cite

Text

Howson et al. "Delayed Feedback in Generalised Linear Bandits Revisited." Artificial Intelligence and Statistics, 2023.

Markdown

[Howson et al. "Delayed Feedback in Generalised Linear Bandits Revisited." Artificial Intelligence and Statistics, 2023.](https://mlanthology.org/aistats/2023/howson2023aistats-delayed/)

BibTeX

@inproceedings{howson2023aistats-delayed,
  title     = {{Delayed Feedback in Generalised Linear Bandits Revisited}},
  author    = {Howson, Benjamin and Pike-Burke, Ciara and Filippi, Sarah},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2023},
  pages     = {6095-6119},
  volume    = {206},
  url       = {https://mlanthology.org/aistats/2023/howson2023aistats-delayed/}
}