A Regret Bound for Greedy Partially Observed Stochastic Contextual Bandits

Abstract

Contextual bandits are widely-used models in reinforcement learning for incorporating both generic and idiosyncratic factors in reward functions. The existing approaches rely on full observation of the context vectors, while the problem of learning optimal arms from partially observed contexts remains immature. We show that in the latter setting, decisions can be made more guarded to minimize the risk of pulling sub-optimal arms. More precisely, efficiency is established for Greedy policies that treat the estimates of the unknown parameter and of the unobserved contexts as their true values. That includes non-asymptotic high probability regret bounds that grow logarithmically with the time horizon and linearly with the number of arms. Numerical results that showcase the efficacy of avoiding exploration are provided as well.

Cite

Text

Park and Faradonbeh. "A Regret Bound for Greedy Partially Observed Stochastic Contextual Bandits." ICML 2022 Workshops: DARL, 2022.

Markdown

[Park and Faradonbeh. "A Regret Bound for Greedy Partially Observed Stochastic Contextual Bandits." ICML 2022 Workshops: DARL, 2022.](https://mlanthology.org/icmlw/2022/park2022icmlw-regret/)

BibTeX

@inproceedings{park2022icmlw-regret,
  title     = {{A Regret Bound for Greedy Partially Observed Stochastic Contextual Bandits}},
  author    = {Park, Hongju and Faradonbeh, Mohamad Kazem Shirani},
  booktitle = {ICML 2022 Workshops: DARL},
  year      = {2022},
  url       = {https://mlanthology.org/icmlw/2022/park2022icmlw-regret/}
}