DBSCAN++: Towards Fast and Scalable Density Clustering

Abstract

DBSCAN is a classical density-based clustering procedure with tremendous practical relevance. However, DBSCAN implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which is too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a chosen subset of points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest.

Cite

Text

Jang and Jiang. "DBSCAN++: Towards Fast and Scalable Density Clustering." International Conference on Machine Learning, 2019.

Markdown

[Jang and Jiang. "DBSCAN++: Towards Fast and Scalable Density Clustering." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/jang2019icml-dbscan/)

BibTeX

@inproceedings{jang2019icml-dbscan,
  title     = {{DBSCAN++: Towards Fast and Scalable Density Clustering}},
  author    = {Jang, Jennifer and Jiang, Heinrich},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {3019-3029},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/jang2019icml-dbscan/}
}