RMSProp Converges with Proper Hyper-Parameter
Abstract
Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyper-parameters under certain conditions. More specifically, we prove that when the hyper-parameter $\beta_2$ is close enough to $1$, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on ``bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. Finally, based on our theory, we conjecture that in practice there is a critical threshold $\sf{\beta_2^*}$, such that RMSprop generates reasonably good results only if $1>\beta_2\ge \sf{\beta_2^*}$. We provide empirical evidence for such a phase transition in our numerical experiments.
Cite
Text
Shi et al. "RMSProp Converges with Proper Hyper-Parameter." International Conference on Learning Representations, 2021.Markdown
[Shi et al. "RMSProp Converges with Proper Hyper-Parameter." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/shi2021iclr-rmsprop/)BibTeX
@inproceedings{shi2021iclr-rmsprop,
title = {{RMSProp Converges with Proper Hyper-Parameter}},
author = {Shi, Naichen and Li, Dawei and Hong, Mingyi and Sun, Ruoyu},
booktitle = {International Conference on Learning Representations},
year = {2021},
url = {https://mlanthology.org/iclr/2021/shi2021iclr-rmsprop/}
}