Stochastic Neighbor Compression

Abstract

We present Stochastic Neighborhood Compression (SNC), an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1-nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with ball-tree data structures, hashing, and LMNN dimensionality reduction, demonstrating that it is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements.

Cite

Text

Kusner et al. "Stochastic Neighbor Compression." International Conference on Machine Learning, 2014.

Markdown

[Kusner et al. "Stochastic Neighbor Compression." International Conference on Machine Learning, 2014.](https://mlanthology.org/icml/2014/kusner2014icml-stochastic/)

BibTeX

@inproceedings{kusner2014icml-stochastic,
  title     = {{Stochastic Neighbor Compression}},
  author    = {Kusner, Matt and Tyree, Stephen and Weinberger, Kilian and Agrawal, Kunal},
  booktitle = {International Conference on Machine Learning},
  year      = {2014},
  pages     = {622-630},
  volume    = {32},
  url       = {https://mlanthology.org/icml/2014/kusner2014icml-stochastic/}
}