Learning Efficient Parameter Server Synchronization Policies for Distributed SGD

Abstract

We apply a reinforcement learning (RL) based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of machine learning models with Stochastic Gradient Descent (SGD). Utilizing a formal synchronization policy description in the PS-setting, we are able to derive a suitable and compact description of states and actions, allowing us to efficiently use the standard off-the-shelf deep Q-learning algorithm. As a result, we are able to learn synchronization policies which generalize to different cluster environments, different training datasets and small model variations and (most importantly) lead to considerable decreases in training time when compared to standard policies such as bulk synchronous parallel (BSP), asynchronous parallel (ASP), or stale synchronous parallel (SSP). To support our claims we present extensive numerical results obtained from experiments performed in simulated cluster environments. In our experiments training time is reduced by 44 on average and learned policies generalize to multiple unseen circumstances.

Cite

Text

Zhu et al. "Learning Efficient Parameter Server Synchronization Policies for Distributed SGD." International Conference on Learning Representations, 2020.

Markdown

[Zhu et al. "Learning Efficient Parameter Server Synchronization Policies for Distributed SGD." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/zhu2020iclr-learning/)

BibTeX

@inproceedings{zhu2020iclr-learning,
  title     = {{Learning Efficient Parameter Server Synchronization Policies for Distributed SGD}},
  author    = {Zhu, Rong and Yang, Sheng and Pfadler, Andreas and Qian, Zhengping and Zhou, Jingren},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/zhu2020iclr-learning/}
}