Recurrent Natural Policy Gradient for POMDPs
Abstract
Solving partially observable Markov decision processes (POMDPs) is a long-standing challenge in reinforcement learning (RL) due to the inherent curse of dimensionality arising from the non-stationarity of optimal policies. In this paper, we address this by integrating recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a multi-step temporal difference (TD) method within a natural actor-critic (NAC) framework for computational efficiency. We establish non-asymptotic theoretical guarantees for this method, which demonstrate its effectiveness for solving POMDPs and identify the pathological cases that stem from long-term dependencies. By integrating RNNs into the NAC framework with theoretical guarantees, this work advances the theoretical foundation of RL for POMDPs and provides a scalable framework for solving complex decision-making problems.
Cite
Text
Cayci and Eryilmaz. "Recurrent Natural Policy Gradient for POMDPs." Transactions on Machine Learning Research, 2025.Markdown
[Cayci and Eryilmaz. "Recurrent Natural Policy Gradient for POMDPs." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/cayci2025tmlr-recurrent/)BibTeX
@article{cayci2025tmlr-recurrent,
title = {{Recurrent Natural Policy Gradient for POMDPs}},
author = {Cayci, Semih and Eryilmaz, Atilla},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/cayci2025tmlr-recurrent/}
}