A K-Median Algorithm with Running Time Independent of Data Size

Abstract

We give a sampling-based algorithm for the k -Median problem, with running time O ( k \((\frac{k^2 }{ \in } \log k)^2 \) log \((\frac{k}{ \in } \log k)\) ), where k is the desired number of clusters and ∈ is a confidence parameter. This is the first k -Median algorithm with fully polynomial running time that is independent of n , the size of the data set. It gives a solution that is, with high probability, an O (1)-approximation, if each cluster in some optimal solution has Ω \((\frac{{n \in }}k)\) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k -Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.

Cite

Text

Meyerson et al. "A K-Median Algorithm with Running Time Independent of Data Size." Machine Learning, 2004. doi:10.1023/B:MACH.0000033115.78247.F0

Markdown

[Meyerson et al. "A K-Median Algorithm with Running Time Independent of Data Size." Machine Learning, 2004.](https://mlanthology.org/mlj/2004/meyerson2004mlj-kmedian/) doi:10.1023/B:MACH.0000033115.78247.F0

BibTeX

@article{meyerson2004mlj-kmedian,
  title     = {{A K-Median Algorithm with Running Time Independent of Data Size}},
  author    = {Meyerson, Adam and O'Callaghan, Liadan and Plotkin, Serge A.},
  journal   = {Machine Learning},
  year      = {2004},
  pages     = {61-87},
  doi       = {10.1023/B:MACH.0000033115.78247.F0},
  volume    = {56},
  url       = {https://mlanthology.org/mlj/2004/meyerson2004mlj-kmedian/}
}