Fast Distributed K-Means with a Small Number of Rounds

Abstract

We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means$||$ algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means$||$, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means$||$.

Cite

Text

Hess et al. "Fast Distributed K-Means with a Small Number of Rounds." Artificial Intelligence and Statistics, 2023.

Markdown

[Hess et al. "Fast Distributed K-Means with a Small Number of Rounds." Artificial Intelligence and Statistics, 2023.](https://mlanthology.org/aistats/2023/hess2023aistats-fast/)

BibTeX

@inproceedings{hess2023aistats-fast,
  title     = {{Fast Distributed K-Means with a Small Number of Rounds}},
  author    = {Hess, Tom and Visbord, Ron and Sabato, Sivan},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2023},
  pages     = {850-874},
  volume    = {206},
  url       = {https://mlanthology.org/aistats/2023/hess2023aistats-fast/}
}