Single-Pass PCA of Large High-Dimensional Data
Abstract
Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the singular vectors corresponding to a number of dominant singular values of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm's accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of high-dimensional data stored as a 150 GB file, the proposed algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.
Cite
Text
Yu et al. "Single-Pass PCA of Large High-Dimensional Data." International Joint Conference on Artificial Intelligence, 2017. doi:10.24963/IJCAI.2017/468Markdown
[Yu et al. "Single-Pass PCA of Large High-Dimensional Data." International Joint Conference on Artificial Intelligence, 2017.](https://mlanthology.org/ijcai/2017/yu2017ijcai-single/) doi:10.24963/IJCAI.2017/468BibTeX
@inproceedings{yu2017ijcai-single,
title = {{Single-Pass PCA of Large High-Dimensional Data}},
author = {Yu, Wenjian and Gu, Yu and Li, Jian and Liu, Shenghua and Li, Yaohang},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2017},
pages = {3350-3356},
doi = {10.24963/IJCAI.2017/468},
url = {https://mlanthology.org/ijcai/2017/yu2017ijcai-single/}
}