Recurrent Natural Policy Gradient for POMDPs

Abstract

Solving partially observable Markov decision processes (POMDPs) is a long-standing challenge in reinforcement learning (RL) due to the inherent curse of dimensionality arising from the non-stationarity of optimal policies. In this paper, we address this by integrating recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a multi-step temporal difference (TD) method within a natural actor-critic (NAC) framework for computational efficiency. We establish non-asymptotic theoretical guarantees for this method, which demonstrate its effectiveness for solving POMDPs and identify the pathological cases that stem from long-term dependencies. By integrating RNNs into the NAC framework with theoretical guarantees, this work advances the theoretical foundation of RL for POMDPs and provides a scalable framework for solving complex decision-making problems.

Cite

Text

Cayci and Eryilmaz. "Recurrent Natural Policy Gradient for POMDPs." Transactions on Machine Learning Research, 2025.

Markdown

[Cayci and Eryilmaz. "Recurrent Natural Policy Gradient for POMDPs." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/cayci2025tmlr-recurrent/)

BibTeX

@article{cayci2025tmlr-recurrent,
  title     = {{Recurrent Natural Policy Gradient for POMDPs}},
  author    = {Cayci, Semih and Eryilmaz, Atilla},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/cayci2025tmlr-recurrent/}
}