On the Sample Complexity Bounds of Bilevel Reinforcement Learning
Abstract
Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain relatively underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\tilde{\mathcal{O}}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\tilde{\mathcal{O}}(\epsilon^{-3})$ improving upon existing bounds of $\tilde{\mathcal{O}}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.
Cite
Text
Gaur et al. "On the Sample Complexity Bounds of Bilevel Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.Markdown
[Gaur et al. "On the Sample Complexity Bounds of Bilevel Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gaur2025neurips-sample/)BibTeX
@inproceedings{gaur2025neurips-sample,
title = {{On the Sample Complexity Bounds of Bilevel Reinforcement Learning}},
author = {Gaur, Mudit and Singh, Utsav and Bedi, Amrit Singh and Pasupathy, Raghu and Aggarwal, Vaneet},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/gaur2025neurips-sample/}
}