Approximate K-Means++ in Sublinear Time

Abstract

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, large-scale datasets while offering a reduction in runtime of several orders of magnitude.

Cite

Text

Bachem et al. "Approximate K-Means++ in Sublinear Time." AAAI Conference on Artificial Intelligence, 2016. doi:10.1609/AAAI.V30I1.10259

Markdown

[Bachem et al. "Approximate K-Means++ in Sublinear Time." AAAI Conference on Artificial Intelligence, 2016.](https://mlanthology.org/aaai/2016/bachem2016aaai-approximate/) doi:10.1609/AAAI.V30I1.10259

BibTeX

@inproceedings{bachem2016aaai-approximate,
  title     = {{Approximate K-Means++ in Sublinear Time}},
  author    = {Bachem, Olivier and Lucic, Mario and Hassani, S. Hamed and Krause, Andreas},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2016},
  pages     = {1459-1467},
  doi       = {10.1609/AAAI.V30I1.10259},
  url       = {https://mlanthology.org/aaai/2016/bachem2016aaai-approximate/}
}