Explainable K-Means and K-Medians Clustering

Abstract

Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with k leaves must incur an Omega(log k) approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general k > 2, we design an efficient algorithm that leads to an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

Cite

Text

Moshkovitz et al. "Explainable K-Means and K-Medians Clustering." International Conference on Machine Learning, 2020.

Markdown

[Moshkovitz et al. "Explainable K-Means and K-Medians Clustering." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/moshkovitz2020icml-explainable/)

BibTeX

@inproceedings{moshkovitz2020icml-explainable,
  title     = {{Explainable K-Means and K-Medians Clustering}},
  author    = {Moshkovitz, Michal and Dasgupta, Sanjoy and Rashtchian, Cyrus and Frost, Nave},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {7055-7065},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/moshkovitz2020icml-explainable/}
}