A Streaming Parallel Decision Tree Algorithm

Abstract

We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm's accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy.

Cite

Text

Ben-Haim and Tom-Tov. "A Streaming Parallel Decision Tree Algorithm." Journal of Machine Learning Research, 2010.

Markdown

[Ben-Haim and Tom-Tov. "A Streaming Parallel Decision Tree Algorithm." Journal of Machine Learning Research, 2010.](https://mlanthology.org/jmlr/2010/benhaim2010jmlr-streaming/)

BibTeX

@article{benhaim2010jmlr-streaming,
  title     = {{A Streaming Parallel Decision Tree Algorithm}},
  author    = {Ben-Haim, Yael and Tom-Tov, Elad},
  journal   = {Journal of Machine Learning Research},
  year      = {2010},
  pages     = {849-872},
  volume    = {11},
  url       = {https://mlanthology.org/jmlr/2010/benhaim2010jmlr-streaming/}
}