Optimal Time Bounds for Approximate Clustering
Abstract
AbstractClustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect to the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O( $k \log \frac{n}{k}$ )) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say $\frac{1}{{100}}$ ) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.
Cite
Text
Mettu and Plaxton. "Optimal Time Bounds for Approximate Clustering." Machine Learning, 2004. doi:10.1023/B:MACH.0000033114.18632.E0Markdown
[Mettu and Plaxton. "Optimal Time Bounds for Approximate Clustering." Machine Learning, 2004.](https://mlanthology.org/mlj/2004/mettu2004mlj-optimal/) doi:10.1023/B:MACH.0000033114.18632.E0BibTeX
@article{mettu2004mlj-optimal,
title = {{Optimal Time Bounds for Approximate Clustering}},
author = {Mettu, Ramgopal R. and Plaxton, C. Greg},
journal = {Machine Learning},
year = {2004},
pages = {35-60},
doi = {10.1023/B:MACH.0000033114.18632.E0},
volume = {56},
url = {https://mlanthology.org/mlj/2004/mettu2004mlj-optimal/}
}