Fast Parallel Training of Neural Language Models

Abstract

Training neural language models (NLMs) is very time consuming and we need parallelization for system speedup. However, standard training methods have poor scalability across multiple devices (e.g., GPUs) due to the huge time cost required to transmit data for gradient sharing in the back-propagation process. In this paper we present a sampling-based approach to reducing data transmission for better scaling of NLMs. As a ''bonus'', the resulting model also improves the training speed on a single device. Our approach yields significant speed improvements on a recurrent neural network-based language model. On four NVIDIA GTX1080 GPUs, it achieves a speedup of 2.1+ times over the standard asynchronous stochastic gradient descent baseline, yet with no increase in perplexity. This is even 4.2 times faster than the naive single GPU counterpart.

Cite

Text

Xiao et al. "Fast Parallel Training of Neural Language Models." International Joint Conference on Artificial Intelligence, 2017. doi:10.24963/IJCAI.2017/586

Markdown

[Xiao et al. "Fast Parallel Training of Neural Language Models." International Joint Conference on Artificial Intelligence, 2017.](https://mlanthology.org/ijcai/2017/xiao2017ijcai-fast/) doi:10.24963/IJCAI.2017/586

BibTeX

@inproceedings{xiao2017ijcai-fast,
  title     = {{Fast Parallel Training of Neural Language Models}},
  author    = {Xiao, Tong and Zhu, Jingbo and Liu, Tongran and Zhang, Chunliang},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2017},
  pages     = {4193-4199},
  doi       = {10.24963/IJCAI.2017/586},
  url       = {https://mlanthology.org/ijcai/2017/xiao2017ijcai-fast/}
}