Asynchronous Training Schemes in Distributed Learning with Time Delay

Abstract

In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. This issue is usually tackled by delay tolerant algorithms with some mild assumptions on the objective functions and step sizes. In this paper, we propose a different approach to develop a new algorithm, called \textbf{P}redicting \textbf{C}lipping \textbf{A}synchronous \textbf{S}tochastic \textbf{G}radient \textbf{D}escent (aka, PC-ASGD). Specifically, PC-ASGD has two steps - the \textit{predicting step} leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while the \textit{clipping step} selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. Theoretically, we present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are weakly strongly-convex, general convex, and nonconvex. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter. For empirical validation, we demonstrate the performance of the algorithm with four deep neural network architectures on three benchmark datasets.

Cite

Text

Wang et al. "Asynchronous Training Schemes in Distributed Learning with Time Delay." Transactions on Machine Learning Research, 2024.

Markdown

[Wang et al. "Asynchronous Training Schemes in Distributed Learning with Time Delay." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/wang2024tmlr-asynchronous/)

BibTeX

@article{wang2024tmlr-asynchronous,
  title     = {{Asynchronous Training Schemes in Distributed Learning with Time Delay}},
  author    = {Wang, Haoxiang and Jiang, Zhanhong and Liu, Chao and Sarkar, Soumik and Jiang, Dongxiang and Lee, Young M},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/wang2024tmlr-asynchronous/}
}