Online Episodic Convex Reinforcement Learning

Abstract

We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent’s policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

Cite

Text

Moreno et al. "Online Episodic Convex Reinforcement Learning." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Moreno et al. "Online Episodic Convex Reinforcement Learning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/moreno2025icml-online/)

BibTeX

@inproceedings{moreno2025icml-online,
  title     = {{Online Episodic Convex Reinforcement Learning}},
  author    = {Moreno, Bianca Marin and Eldowa, Khaled and Gaillard, Pierre and Brégère, Margaux and Oudjane, Nadia},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {44775-44824},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/moreno2025icml-online/}
}