Offline Actor-Critic for Average Reward MDPs

Abstract

We study offline policy optimization for infinite-horizon average-reward Markov decision processes (MDPs) with large or infinite state spaces. Specifically, we propose a pessimistic actor-critic algorithm that uses a computationally efficient linear function class for value function estimation. At the core of our method is a critic that computes a pessimistic estimate of the average reward under the current policy, as well as the corresponding policy gradient, by solving a fixed-point Bellman equation, rather than solving a successive sequence of regression problems as in finite horizon settings. This procedure reduces to solving a second-order cone program, which is computationally tractable. Our theoretical analysis is based on a weak partial data coverage assumption, which requires only that the offline data aligns well with the expected feature vector of a comparator policy. Under this condition, we show that our algorithm achieves the optimal sample complexity of O(\varepsilon^-2) for learning a near-optimal policy, up to model misspecification errors.

Cite

Text

Powell et al. "Offline Actor-Critic for Average Reward MDPs." Advances in Neural Information Processing Systems, 2025.

Markdown

[Powell et al. "Offline Actor-Critic for Average Reward MDPs." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/powell2025neurips-offline/)

BibTeX

@inproceedings{powell2025neurips-offline,
  title     = {{Offline Actor-Critic for Average Reward MDPs}},
  author    = {Powell, William and Kwon, Jeongyeol and Xie, Qiaomin and Lyu, Hanbaek},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/powell2025neurips-offline/}
}