On the Sample Complexity Bounds of Bilevel Reinforcement Learning

Abstract

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain relatively underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\tilde{\mathcal{O}}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\tilde{\mathcal{O}}(\epsilon^{-3})$ improving upon existing bounds of $\tilde{\mathcal{O}}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.

Cite

Text

Gaur et al. "On the Sample Complexity Bounds of Bilevel Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Gaur et al. "On the Sample Complexity Bounds of Bilevel Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gaur2025neurips-sample/)

BibTeX

@inproceedings{gaur2025neurips-sample,
  title     = {{On the Sample Complexity Bounds of Bilevel Reinforcement Learning}},
  author    = {Gaur, Mudit and Singh, Utsav and Bedi, Amrit Singh and Pasupathy, Raghu and Aggarwal, Vaneet},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/gaur2025neurips-sample/}
}