Improved Distributed Principal Component Analysis
Abstract
We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as $k$-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as input-sparsity subspace embeddings with high correctness probability with a dimension and sparsity independent of the error probability, may be of independent interest.
Cite
Text
Liang et al. "Improved Distributed Principal Component Analysis." Neural Information Processing Systems, 2014.Markdown
[Liang et al. "Improved Distributed Principal Component Analysis." Neural Information Processing Systems, 2014.](https://mlanthology.org/neurips/2014/liang2014neurips-improved/)BibTeX
@inproceedings{liang2014neurips-improved,
title = {{Improved Distributed Principal Component Analysis}},
author = {Liang, Yingyu and Balcan, Maria-Florina F and Kanchanapally, Vandana and Woodruff, David},
booktitle = {Neural Information Processing Systems},
year = {2014},
pages = {3113-3121},
url = {https://mlanthology.org/neurips/2014/liang2014neurips-improved/}
}