Scalable DBSCAN with Random Projections

Abstract

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. sDBSCAN leverages recent advancements in random projections given a significantly large number of random vectors to quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN preserves the DBSCAN’s clustering structure under mild conditions with high probability. To facilitate sDBSCAN, we present sOPTICS, a scalable visual tool to guide the parameter setting of sDBSCAN. We also extend sDBSCAN and sOPTICS to L2, L1, χ2, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than competitive DBSCAN variants on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn counterparts and other clustering competitors demand several hours orcannot run on our hardware due to memory constraints. Our code is available at https://github.com/NinhPham/sDbscan.

Cite

Text

Xu and Pham. "Scalable DBSCAN with Random Projections." Neural Information Processing Systems, 2024. doi:10.52202/079017-0877

Markdown

[Xu and Pham. "Scalable DBSCAN with Random Projections." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/xu2024neurips-scalable/) doi:10.52202/079017-0877

BibTeX

@inproceedings{xu2024neurips-scalable,
  title     = {{Scalable DBSCAN with Random Projections}},
  author    = {Xu, HaoChuan and Pham, Ninh},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0877},
  url       = {https://mlanthology.org/neurips/2024/xu2024neurips-scalable/}
}