Resmax: An Alternative Soft-Greedy Operator for Reinforcement Learning

Abstract

Soft-greedy operators, namely $\varepsilon$-greedy and softmax, remain a common choice to induce a basic level of exploration for action-value methods in reinforcement learning. These operators, however, have a few critical limitations. In this work, we investigate a simple soft-greedy operator, which we call resmax, that takes actions proportionally to their max action gap: the residual to the estimated maximal value. It is simple to use and ensures coverage of the state-space like $\varepsilon$-greedy, but focuses exploration more on potentially promising actions like softmax. Further, it does not concentrate probability as quickly as softmax, and so better avoids overemphasizing sub-optimal actions that appear high-valued during learning. Additionally, we prove it is a non-expansion for any fixed exploration hyperparameter, unlike the softmax policy which requires a state-action specific temperature to obtain a non-expansion (called mellowmax). We empirically validate that resmax is comparable to or outperforms $\varepsilon$-greedy and softmax across a variety of environments in tabular and deep RL.

Cite

Text

Miahi et al. "Resmax: An Alternative Soft-Greedy Operator for Reinforcement Learning." Transactions on Machine Learning Research, 2023.

Markdown

[Miahi et al. "Resmax: An Alternative Soft-Greedy Operator for Reinforcement Learning." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/miahi2023tmlr-resmax/)

BibTeX

@article{miahi2023tmlr-resmax,
  title     = {{Resmax: An Alternative Soft-Greedy Operator for Reinforcement Learning}},
  author    = {Miahi, Erfan and MacQueen, Revan and Ayoub, Alex and Masoumzadeh, Abbas and White, Martha},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/miahi2023tmlr-resmax/}
}