A Framework for Parallelizing Hierarchical Clustering Methods

Abstract

Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k -means, k -median, and k -center and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods. We introduce efficient distributed algorithms for bisecting k -means, k -median, and k -center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.

Cite

Text

Lattanzi et al. "A Framework for Parallelizing Hierarchical Clustering Methods." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019. doi:10.1007/978-3-030-46150-8_5

Markdown

[Lattanzi et al. "A Framework for Parallelizing Hierarchical Clustering Methods." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019.](https://mlanthology.org/ecmlpkdd/2019/lattanzi2019ecmlpkdd-framework/) doi:10.1007/978-3-030-46150-8_5

BibTeX

@inproceedings{lattanzi2019ecmlpkdd-framework,
  title     = {{A Framework for Parallelizing Hierarchical Clustering Methods}},
  author    = {Lattanzi, Silvio and Lavastida, Thomas and Lu, Kefu and Moseley, Benjamin},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2019},
  pages     = {73-89},
  doi       = {10.1007/978-3-030-46150-8_5},
  url       = {https://mlanthology.org/ecmlpkdd/2019/lattanzi2019ecmlpkdd-framework/}
}