Learning the K in K-Means

NeurIPS 2003 pp. 281-288

/neurips/2003/hamerly2003neurips-learning/

Abstract

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic prob- lem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test ac- cepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the stand- ard statistical signiﬁcance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity.

PDF NeurIPS Semantic Scholar

Cite

Text

Hamerly and Elkan. "Learning the K in K-Means." Neural Information Processing Systems, 2003.

Markdown

[Hamerly and Elkan. "Learning the K in K-Means." Neural Information Processing Systems, 2003.](https://mlanthology.org/neurips/2003/hamerly2003neurips-learning/)

BibTeX

@inproceedings{hamerly2003neurips-learning,
  title     = {{Learning the K in K-Means}},
  author    = {Hamerly, Greg and Elkan, Charles},
  booktitle = {Neural Information Processing Systems},
  year      = {2003},
  pages     = {281-288},
  url       = {https://mlanthology.org/neurips/2003/hamerly2003neurips-learning/}
}