A Beyond-Worst-Case Analysis of Greedy K-Means++
Abstract
$k$-means++ and the related greedy $k$-means++ algorithm are celebrated algorithms that efficiently compute seeds for Lloyd's algorithm. Greedy $k$-means++ is a generalization of $k$-means++ where, in each iteration, a new seed is greedily chosen among multiple $\ell \geq 2$ points sampled, as opposed to a single seed being sampled in $k$-means++. While empirical studies consistently show the superior performance of greedy $k$-means++, making it a preferred method in practice, a discrepancy exists between theory and practice. No theoretical justification currently explains this improved performance. Indeed, the prevailing theory suggests that greedy $k$-means++ exhibits worse performance than $k$-means++ in worst-case scenarios. This paper presents an analysis demonstrating the outperformance of the greedy algorithm compared to $k$-means++ for a natural class of well-separated instances with exponentially decaying distributions, such as Gaussian, specifically when $\ell = \Theta(\log k)$, a common parameter setting in practical applications.
Cite
Text
Chen et al. "A Beyond-Worst-Case Analysis of Greedy K-Means++." Advances in Neural Information Processing Systems, 2025.Markdown
[Chen et al. "A Beyond-Worst-Case Analysis of Greedy K-Means++." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chen2025neurips-beyondworstcase/)BibTeX
@inproceedings{chen2025neurips-beyondworstcase,
title = {{A Beyond-Worst-Case Analysis of Greedy K-Means++}},
author = {Chen, Qingyun and Im, Sungjin and Moseley, Benjamin and Milstrey, Ryan and Xu, Chenyang and Zhang, Ruilong},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/chen2025neurips-beyondworstcase/}
}