Pivot-Based Distributed K-Nearest Neighbor Mining
Abstract
k -nearest neighbor ( k NN) search is a fundamental data mining task critical to many data analytics methods. Yet no effective techniques to date scale k NN search to large datasets. In this work we present PkNN, an exact distributed method that by leveraging modern distributed architectures for the first time scales k NN search to billion point datasets. The key to the PkNN strategy is a multi-round kNN search that exploits pivot-based data partitioning at each stage. This includes an outlier-driven partition adjustment mechanism that effectively minimizes data duplication and achieves a balanced workload across the compute cluster. Aggressive data-driven bounds along with a tiered support assignment strategy ensure correctness while limiting computation costs. Our experimental study on multi-dimensional real-world data demonstrates that PkNN achieves significant speedup over the state-of-the-art and scales effectively in data cardinality. Code and data related to this chapter are available at: http://solar-10.wpi.edu/cakuhlman/PkNN .
Cite
Text
Kuhlman et al. "Pivot-Based Distributed K-Nearest Neighbor Mining." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017. doi:10.1007/978-3-319-71246-8_51Markdown
[Kuhlman et al. "Pivot-Based Distributed K-Nearest Neighbor Mining." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017.](https://mlanthology.org/ecmlpkdd/2017/kuhlman2017ecmlpkdd-pivotbased/) doi:10.1007/978-3-319-71246-8_51BibTeX
@inproceedings{kuhlman2017ecmlpkdd-pivotbased,
title = {{Pivot-Based Distributed K-Nearest Neighbor Mining}},
author = {Kuhlman, Caitlin and Yan, Yizhou and Cao, Lei and Rundensteiner, Elke A.},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2017},
pages = {843-860},
doi = {10.1007/978-3-319-71246-8_51},
url = {https://mlanthology.org/ecmlpkdd/2017/kuhlman2017ecmlpkdd-pivotbased/}
}