Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Abstract

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Cite

Text

Santos et al. "Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning." International Conference on Learning Representations, 2026.

Markdown

[Santos et al. "Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/santos2026iclr-solving/)

BibTeX

@inproceedings{santos2026iclr-solving,
  title     = {{Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning}},
  author    = {Santos, Pedro Pinto and Sardinha, Alberto and Melo, Francisco S.},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/santos2026iclr-solving/}
}