Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback
Abstract
In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.
Cite
Text
Zhang et al. "Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Zhang et al. "Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhang2025icml-minimax/)BibTeX
@inproceedings{zhang2025icml-minimax,
title = {{Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback}},
author = {Zhang, Zihan and Chen, Yuxin and Lee, Jason D. and Du, Simon Shaolei and Wang, Ruosong},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {74692-74713},
volume = {267},
url = {https://mlanthology.org/icml/2025/zhang2025icml-minimax/}
}