Rates for Offline Reinforcement Learning with Adaptively Collected Data
Abstract
Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.
Cite
Text
Madhow et al. "Rates for Offline Reinforcement Learning with Adaptively Collected Data." Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, 2025.Markdown
[Madhow et al. "Rates for Offline Reinforcement Learning with Adaptively Collected Data." Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, 2025.](https://mlanthology.org/l4dc/2025/madhow2025l4dc-rates/)BibTeX
@inproceedings{madhow2025l4dc-rates,
title = {{Rates for Offline Reinforcement Learning with Adaptively Collected Data}},
author = {Madhow, Sunil and Qiao, Dan and Yin, Ming and Wang, Yu-Xiang},
booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference},
year = {2025},
pages = {259-271},
volume = {283},
url = {https://mlanthology.org/l4dc/2025/madhow2025l4dc-rates/}
}