Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption

Yiding Chen, Xuezhou Zhang, Qiaomin Xie, Xiaojin Zhu

AAAI 2024 pp. 11416-11424

doi:10.1609/AAAI.V38I10.29022 /aaai/2024/chen2024aaai-exact/

Abstract

We study offline reinforcement learning (RL) with heavy-tailed reward distribution and data corruption: (i) Moving beyond subGaussian reward distribution, we allow the rewards to have infinite variances; (ii) We allow corruptions where an attacker can arbitrarily modify a small fraction of the rewards and transitions in the dataset. We first derive a sufficient optimality condition for generalized Pessimistic Value Iteration (PEVI), which allows various estimators with proper confidence bounds and can be applied to multiple learning settings. In order to handle the data corruption and heavy-tailed reward setting, we prove that the trimmed-mean estimation achieves the minimax optimal error rate for robust mean estimation under heavy-tailed distributions. In the PEVI algorithm, we plug in the trimmed mean estimation and the confidence bound to solve the robust offline RL problem. Standard analysis reveals that data corruption induces a bias term in the suboptimality gap, which gives the false impression that any data corruption prevents optimal policy learning. By using the optimality condition for the generalized PEVI, we show that as long as the bias term is less than the ``action gap'', the policy returned by PEVI achieves the optimal value given sufficient data.

PDF AAAI Semantic Scholar

Cite

Text

Chen et al. "Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I10.29022

Markdown

[Chen et al. "Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/chen2024aaai-exact/) doi:10.1609/AAAI.V38I10.29022

BibTeX

@inproceedings{chen2024aaai-exact,
  title     = {{Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption}},
  author    = {Chen, Yiding and Zhang, Xuezhou and Xie, Qiaomin and Zhu, Xiaojin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {11416-11424},
  doi       = {10.1609/AAAI.V38I10.29022},
  url       = {https://mlanthology.org/aaai/2024/chen2024aaai-exact/}
}