Fast and Provably Good Seedings for K-Means

Abstract

Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces *provably* good clusterings even *without assumptions* on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets.

Cite

Text

Bachem et al. "Fast and Provably Good Seedings for K-Means." Neural Information Processing Systems, 2016.

Markdown

[Bachem et al. "Fast and Provably Good Seedings for K-Means." Neural Information Processing Systems, 2016.](https://mlanthology.org/neurips/2016/bachem2016neurips-fast/)

BibTeX

@inproceedings{bachem2016neurips-fast,
  title     = {{Fast and Provably Good Seedings for K-Means}},
  author    = {Bachem, Olivier and Lucic, Mario and Hassani, Hamed and Krause, Andreas},
  booktitle = {Neural Information Processing Systems},
  year      = {2016},
  pages     = {55-63},
  url       = {https://mlanthology.org/neurips/2016/bachem2016neurips-fast/}
}