Batch Learning via Log-Sum-Exponential Estimator from Logged Bandit Feedback
Abstract
Offline policy learning methods in batch learning aim to derive a policy from a logged bandit feedback dataset, encompassing context, action, propensity score and feedback for each sample point. To achieve this objective, inverse propensity score estimators are employed to minimize the cost. However, this approach is susceptible to high variance and poor performance under low-quality propensity scores. In response to these limitations, we propose a novel estimator inspired by the log-sum-exponential operator, mitigating variance. Furthermore, we offer theoretical analysis, encompassing upper bounds on the bias, variance of our estimator, and an upper bound on the generalization error of the log-sum-exponential estimator—the difference between the empirical risk of the log-sum-exponential estimators and the true risk- with a convergence rate of $O(1/\sqrt{n})$ where $n$ is the number of training samples. Additionally, we examine the performance of our estimator under limited access to clean propensity scores and an imbalanced logged bandit feedback dataset, where the number of samples per action is different.
Cite
Text
Behnamnia et al. "Batch Learning via Log-Sum-Exponential Estimator from Logged Bandit Feedback." ICML 2024 Workshops: ARLET, 2024.Markdown
[Behnamnia et al. "Batch Learning via Log-Sum-Exponential Estimator from Logged Bandit Feedback." ICML 2024 Workshops: ARLET, 2024.](https://mlanthology.org/icmlw/2024/behnamnia2024icmlw-batch/)BibTeX
@inproceedings{behnamnia2024icmlw-batch,
title = {{Batch Learning via Log-Sum-Exponential Estimator from Logged Bandit Feedback}},
author = {Behnamnia, Armin and Aminian, Gholamali and Aghaei, Alireza and Shi, Chengchun and Tan, Vincent Y. F. and Rabiee, Hamid R.},
booktitle = {ICML 2024 Workshops: ARLET},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/behnamnia2024icmlw-batch/}
}