Learning the K in K-Means
Abstract
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic prob- lem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test ac- cepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the stand- ard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity.
Cite
Text
Hamerly and Elkan. "Learning the K in K-Means." Neural Information Processing Systems, 2003.Markdown
[Hamerly and Elkan. "Learning the K in K-Means." Neural Information Processing Systems, 2003.](https://mlanthology.org/neurips/2003/hamerly2003neurips-learning/)BibTeX
@inproceedings{hamerly2003neurips-learning,
title = {{Learning the K in K-Means}},
author = {Hamerly, Greg and Elkan, Charles},
booktitle = {Neural Information Processing Systems},
year = {2003},
pages = {281-288},
url = {https://mlanthology.org/neurips/2003/hamerly2003neurips-learning/}
}