Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Abstract

We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the objectives.

Cite

Text

Cheung. "Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives." Neural Information Processing Systems, 2019.

Markdown

[Cheung. "Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/cheung2019neurips-regret/)

BibTeX

@inproceedings{cheung2019neurips-regret,
  title     = {{Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives}},
  author    = {Cheung, Wang Chi},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {726-736},
  url       = {https://mlanthology.org/neurips/2019/cheung2019neurips-regret/}
}