Efficient Distributed Decision Trees for Robust Regression
Abstract
The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach. The data and software related to this paper are available at https://github.com/weilai0980/DRSquare_tree/tree/master/ .
Cite
Text
Guo et al. "Efficient Distributed Decision Trees for Robust Regression." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016. doi:10.1007/978-3-319-46227-1_6Markdown
[Guo et al. "Efficient Distributed Decision Trees for Robust Regression." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.](https://mlanthology.org/ecmlpkdd/2016/guo2016ecmlpkdd-efficient/) doi:10.1007/978-3-319-46227-1_6BibTeX
@inproceedings{guo2016ecmlpkdd-efficient,
title = {{Efficient Distributed Decision Trees for Robust Regression}},
author = {Guo, Tian and Kutzkov, Konstantin and Ahmed, Mohamed and Calbimonte, Jean-Paul and Aberer, Karl},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2016},
pages = {79-95},
doi = {10.1007/978-3-319-46227-1_6},
url = {https://mlanthology.org/ecmlpkdd/2016/guo2016ecmlpkdd-efficient/}
}