Massively Parallel Feature Selection: An Approach Based on Variance Preservation

Zhao, Zheng; Cox, James; Duling, David; Sarle, Warren

doi:10.1007/978-3-642-33460-3_21

Massively Parallel Feature Selection: An Approach Based on Variance Preservation

Zheng Zhao, James Cox, David Duling, Warren Sarle

ECML-PKDD 2012 pp. 237-252

doi:10.1007/978-3-642-33460-3_21 /ecmlpkdd/2012/zhao2012ecmlpkdd-massively/

Abstract

Advances in computer technologies have enabled corporations to accumulate data at an unprecedented speed. Large-scale business data might contain billions of observations and thousands of features, which easily brings their scale to the level of terabytes. Most traditional feature selection algorithms are designed for a centralized computing architecture. Their usability significantly deteriorates when data size exceeds hundreds of gigabytes. High-performance distributed computing frameworks and protocols, such as the Message Passing Interface (MPI) and MapReduce, have been proposed to facilitate software development on grid infrastructures, enabling analysts to process large-scale problems efficiently. This paper presents a novel large-scale feature selection algorithm that is based on variance analysis. The algorithm selects features by evaluating their abilities to explain data variance. It supports both supervised and unsupervised feature selection and can be readily implemented in most distributed computing environments. The algorithm was developed as a SAS High-Performance Analytics procedure, which can read data in distributed form and perform parallel feature selection in both symmetric multiprocessing mode and massively parallel processing mode. Experimental results demonstrated the superior performance of the proposed method for large scale feature selection.

PDF ECML-PKDD Semantic Scholar

Cite

Text

Zhao et al. "Massively Parallel Feature Selection: An Approach Based on Variance Preservation." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2012. doi:10.1007/978-3-642-33460-3_21

Markdown

[Zhao et al. "Massively Parallel Feature Selection: An Approach Based on Variance Preservation." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2012.](https://mlanthology.org/ecmlpkdd/2012/zhao2012ecmlpkdd-massively/) doi:10.1007/978-3-642-33460-3_21

BibTeX

@inproceedings{zhao2012ecmlpkdd-massively,
  title     = {{Massively Parallel Feature Selection: An Approach Based on Variance Preservation}},
  author    = {Zhao, Zheng and Cox, James and Duling, David and Sarle, Warren},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2012},
  pages     = {237-252},
  doi       = {10.1007/978-3-642-33460-3_21},
  url       = {https://mlanthology.org/ecmlpkdd/2012/zhao2012ecmlpkdd-massively/}
}