Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback

Du, Yihan; Kuroki, Yuko; Chen, Wei

doi:10.1609/AAAI.V35I8.16892

Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback

Yihan Du, Yuko Kuroki, Wei Chen

AAAI 2021 pp. 7262-7270

doi:10.1609/AAAI.V35I8.16892 /aaai/2021/du2021aaai-combinatorial/

Abstract

In this paper, we first study the problem of combinatorial pure exploration with full-bandit feedback (CPE-BL), where a learner is given a combinatorial action space X \subseteq 0,1^d, and in each round the learner pulls an action x \in X and receives a random reward with expectation x^T \theta, with \theta \in \R^d a latent and unknown environment vector. The objective is to identify the optimal action with the highest expected reward, using as few samples as possible. For CPE-BL, we design the first polynomial-time adaptive algorithm, whose sample complexity matches the lower bound (within a logarithmic factor) for a family of instances and has a light dependence of \Delta_min (the smallest gap between the optimal action and sub-optimal actions). Furthermore, we propose a novel generalization of CPE-BL with flexible feedback structures, called combinatorial pure exploration with partial linear feedback (CPE-PL), which encompasses several families of sub-problems including full-bandit feedback, semi-bandit feedback, partial feedback and nonlinear reward functions. In CPE-PL, each pull of action x reports a random feedback vector with expectation of M_x \theta , where M_x \in R^{m_x \times d} is a transformation matrix for x, and gains a random (possibly nonlinear) reward related to x. For CPE-PL, we develop the first polynomial-time algorithm, which simultaneously addresses limited feedback, general reward function and combinatorial action space (e.g., matroids, matchings and s-t paths), and provide its sample complexity analysis. Our empirical evaluation demonstrates that our algorithms run orders of magnitude faster than the existing ones, and our CPE-BL algorithm is robust across different \Delta_min settings while our CPE-PL algorithm is the first one returning correct answers for nonlinear reward functions.

PDF AAAI Semantic Scholar

Cite

Text

Du et al. "Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I8.16892

Markdown

[Du et al. "Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/du2021aaai-combinatorial/) doi:10.1609/AAAI.V35I8.16892

BibTeX

@inproceedings{du2021aaai-combinatorial,
  title     = {{Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback}},
  author    = {Du, Yihan and Kuroki, Yuko and Chen, Wei},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {7262-7270},
  doi       = {10.1609/AAAI.V35I8.16892},
  url       = {https://mlanthology.org/aaai/2021/du2021aaai-combinatorial/}
}