Pivot-Based Distributed K-Nearest Neighbor Mining

Abstract

k -nearest neighbor ( k NN) search is a fundamental data mining task critical to many data analytics methods. Yet no effective techniques to date scale k NN search to large datasets. In this work we present PkNN, an exact distributed method that by leveraging modern distributed architectures for the first time scales k NN search to billion point datasets. The key to the PkNN strategy is a multi-round kNN search that exploits pivot-based data partitioning at each stage. This includes an outlier-driven partition adjustment mechanism that effectively minimizes data duplication and achieves a balanced workload across the compute cluster. Aggressive data-driven bounds along with a tiered support assignment strategy ensure correctness while limiting computation costs. Our experimental study on multi-dimensional real-world data demonstrates that PkNN achieves significant speedup over the state-of-the-art and scales effectively in data cardinality. Code and data related to this chapter are available at: http://solar-10.wpi.edu/cakuhlman/PkNN .

Cite

Text

Kuhlman et al. "Pivot-Based Distributed K-Nearest Neighbor Mining." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017. doi:10.1007/978-3-319-71246-8_51

Markdown

[Kuhlman et al. "Pivot-Based Distributed K-Nearest Neighbor Mining." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017.](https://mlanthology.org/ecmlpkdd/2017/kuhlman2017ecmlpkdd-pivotbased/) doi:10.1007/978-3-319-71246-8_51

BibTeX

@inproceedings{kuhlman2017ecmlpkdd-pivotbased,
  title     = {{Pivot-Based Distributed K-Nearest Neighbor Mining}},
  author    = {Kuhlman, Caitlin and Yan, Yizhou and Cao, Lei and Rundensteiner, Elke A.},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2017},
  pages     = {843-860},
  doi       = {10.1007/978-3-319-71246-8_51},
  url       = {https://mlanthology.org/ecmlpkdd/2017/kuhlman2017ecmlpkdd-pivotbased/}
}