High-Performance Distributed ML at Scale Through Parameter Server Consistency Models

Abstract

As Machine Learning (ML) applications embrace greater data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Effective use of clusters for ML programs requires considerable expertise in writing distributed code, but existing highly-abstracted frameworks like Hadoop that pose low barriers to distributed-programming have not, in practice, matched the performance seen in highly specialized and advanced ML implementations. The recent Parameter Server (PS) paradigm is a middle ground between these extremes, allowing easy conversion of single-machine parallel ML programs into distributed ones, while maintaining high throughput through relaxed ``consistency models" that allow asynchronous (and, hence, inconsistent) parameter reads. However, due to insufficient theoretical study, it is not clear which of these consistency models can really ensure correct ML algorithm output; at the same time, there remain many theoretically-motivated but undiscovered opportunities to maximize computational throughput. Inspired by this challenge, we study both the theoretical guarantees and empirical behavior of iterative-convergent ML algorithms in existing PS consistency models. We then use the gleaned insights to improve a consistency model using an "eager" PS communication mechanism, and implement it as a new PS system that enables ML programs to reach their solution more quickly.

Cite

Text

Dai et al. "High-Performance Distributed ML at Scale Through Parameter Server Consistency Models." AAAI Conference on Artificial Intelligence, 2015. doi:10.1609/AAAI.V29I1.9195

Markdown

[Dai et al. "High-Performance Distributed ML at Scale Through Parameter Server Consistency Models." AAAI Conference on Artificial Intelligence, 2015.](https://mlanthology.org/aaai/2015/dai2015aaai-high/) doi:10.1609/AAAI.V29I1.9195

BibTeX

@inproceedings{dai2015aaai-high,
  title     = {{High-Performance Distributed ML at Scale Through Parameter Server Consistency Models}},
  author    = {Dai, Wei and Kumar, Abhimanu and Wei, Jinliang and Ho, Qirong and Gibson, Garth A. and Xing, Eric P.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2015},
  pages     = {79-87},
  doi       = {10.1609/AAAI.V29I1.9195},
  url       = {https://mlanthology.org/aaai/2015/dai2015aaai-high/}
}