A Beyond-Worst-Case Analysis of Greedy K-Means++

Abstract

$k$-means++ and the related greedy $k$-means++ algorithm are celebrated algorithms that efficiently compute seeds for Lloyd's algorithm. Greedy $k$-means++ is a generalization of $k$-means++ where, in each iteration, a new seed is greedily chosen among multiple $\ell \geq 2$ points sampled, as opposed to a single seed being sampled in $k$-means++. While empirical studies consistently show the superior performance of greedy $k$-means++, making it a preferred method in practice, a discrepancy exists between theory and practice. No theoretical justification currently explains this improved performance. Indeed, the prevailing theory suggests that greedy $k$-means++ exhibits worse performance than $k$-means++ in worst-case scenarios. This paper presents an analysis demonstrating the outperformance of the greedy algorithm compared to $k$-means++ for a natural class of well-separated instances with exponentially decaying distributions, such as Gaussian, specifically when $\ell = \Theta(\log k)$, a common parameter setting in practical applications.

Cite

Text

Chen et al. "A Beyond-Worst-Case Analysis of Greedy K-Means++." Advances in Neural Information Processing Systems, 2025.

Markdown

[Chen et al. "A Beyond-Worst-Case Analysis of Greedy K-Means++." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chen2025neurips-beyondworstcase/)

BibTeX

@inproceedings{chen2025neurips-beyondworstcase,
  title     = {{A Beyond-Worst-Case Analysis of Greedy K-Means++}},
  author    = {Chen, Qingyun and Im, Sungjin and Moseley, Benjamin and Milstrey, Ryan and Xu, Chenyang and Zhang, Ruilong},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/chen2025neurips-beyondworstcase/}
}