Distributed Adaptive Optimization with Divisible Communication
Abstract
Synchronous distributed training can scale the training of deep neural networks on large-scale data, thus it has been widely adopted in large-scale applications. Because it often suffers from the communication bottleneck, many methods have been proposed to reduce the communication cost. However, these communication reduction methods often lead to poor performance for the adaptive optimizer, largely due to its non-linearity. To address this challenging issue, we propose a novel method to divide the communication into the foreground and background communication. The foreground communication is more informative but can be of low cost to achieve communication efficiency, while the background communication runs in the background and requires no synchronization time. We use Adam as the base optimizer and achieve $\times 1024$ × 1024 foreground compression ratio on CIFAR-10, $\times 128$ × 128 on non-iid CIFAR-10, $\times 64$ × 64 on ImageNet image classification tasks, and $\times 128$ × 128 on WMT’16 EN-DE machine translation task with comparable performance, which leads to $\times 7$ × 7 , $\times 6.4$ × 6.4 , $\times 3.5$ × 3.5 , and $\times 7$ × 7 training speedup, respectively. Moreover, we provide rigorous theoretical analysis to prove that our method obtains the same convergence rate as Adam and achieves linear speedup regarding the number of workers.
Cite
Text
Xu and Bai. "Distributed Adaptive Optimization with Divisible Communication." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023. doi:10.1007/978-3-031-43418-1_39Markdown
[Xu and Bai. "Distributed Adaptive Optimization with Divisible Communication." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023.](https://mlanthology.org/ecmlpkdd/2023/xu2023ecmlpkdd-distributed/) doi:10.1007/978-3-031-43418-1_39BibTeX
@inproceedings{xu2023ecmlpkdd-distributed,
title = {{Distributed Adaptive Optimization with Divisible Communication}},
author = {Xu, An and Bai, Yang},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2023},
pages = {654-670},
doi = {10.1007/978-3-031-43418-1_39},
url = {https://mlanthology.org/ecmlpkdd/2023/xu2023ecmlpkdd-distributed/}
}