Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification

Abstract

In the field of reinforcement learning, because of the high cost and risk of policy training in the real world, policies are trained in a simulation environment and transferred to the corresponding real-world environment.However, the simulation environment does not perfectly mimic the real-world environment, lead to model misspecification. Multiple studies report significant deterioration of policy performance in a real-world environment.In this study, we focus on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values, called the uncertainty parameter set. The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment.To obtain a policy for the optimization, we propose an off-policy actor-critic approach called the Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach.Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.

Cite

Text

Tanabe et al. "Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification." Neural Information Processing Systems, 2022.

Markdown

[Tanabe et al. "Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/tanabe2022neurips-maxmin/)

BibTeX

@inproceedings{tanabe2022neurips-maxmin,
  title     = {{Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification}},
  author    = {Tanabe, Takumi and Sato, Rei and Fukuchi, Kazuto and Sakuma, Jun and Akimoto, Youhei},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/tanabe2022neurips-maxmin/}
}