Parallel Correlation Clustering on Big Graphs

Abstract

Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, in practice KwikCluster requires a large number of clustering rounds, a potential bottleneck for large graphs.We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds, and provably achieve nearly linear speedups. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3 approximation ratio.We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

Cite

Text

Pan et al. "Parallel Correlation Clustering on Big Graphs." Neural Information Processing Systems, 2015.

Markdown

[Pan et al. "Parallel Correlation Clustering on Big Graphs." Neural Information Processing Systems, 2015.](https://mlanthology.org/neurips/2015/pan2015neurips-parallel/)

BibTeX

@inproceedings{pan2015neurips-parallel,
  title     = {{Parallel Correlation Clustering on Big Graphs}},
  author    = {Pan, Xinghao and Papailiopoulos, Dimitris and Oymak, Samet and Recht, Benjamin and Ramchandran, Kannan and Jordan, Michael I},
  booktitle = {Neural Information Processing Systems},
  year      = {2015},
  pages     = {82-90},
  url       = {https://mlanthology.org/neurips/2015/pan2015neurips-parallel/}
}