Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Covariates

Abstract

In this paper, we propose a Minimax Concave Penalized Multi-Armed Bandit (MCP-Bandit) algorithm for a decision-maker facing high-dimensional data with latent sparse structure in an online learning and decision-making process. We demonstrate that the MCP-Bandit algorithm asymptotically achieves the optimal cumulative regret in sample size T, O(log T), and further attains a tighter bound in both covariates dimension d and the number of significant covariates s, O(s^2 (s + log d). In addition, we develop a linear approximation method, the 2-step Weighted Lasso procedure, to identify the MCP estimator for the MCP-Bandit algorithm under non-i.i.d. samples. Using this procedure, the MCP estimator matches the oracle estimator with high probability. Finally, we present two experiments to benchmark our proposed the MCP-Bandit algorithm to other bandit algorithms. Both experiments demonstrate that the MCP-Bandit algorithm performs favorably over other benchmark algorithms, especially when there is a high level of data sparsity or when the sample size is not too small.

Cite

Text

Wang et al. "Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Covariates." International Conference on Machine Learning, 2018.

Markdown

[Wang et al. "Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Covariates." International Conference on Machine Learning, 2018.](https://mlanthology.org/icml/2018/wang2018icml-minimax/)

BibTeX

@inproceedings{wang2018icml-minimax,
  title     = {{Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Covariates}},
  author    = {Wang, Xue and Wei, Mingcheng and Yao, Tao},
  booktitle = {International Conference on Machine Learning},
  year      = {2018},
  pages     = {5200-5208},
  volume    = {80},
  url       = {https://mlanthology.org/icml/2018/wang2018icml-minimax/}
}