Coresets for Multiple $\ell_p$ Regression

ICML 2024 pp. 53202-53233

/icml/2024/woodruff2024icml-coresets/

Abstract

A coreset of a dataset with $n$ examples and $d$ features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and $\ell_p$ linear regression with a single response are known in prior work. However, for multiple $\ell_p$ regression where there can be $m$ responses, there are no known constructions with size sublinear in $m$. In this work, we construct coresets of size $\tilde O(\varepsilon^{-2}d)$ for $p<2$ and $\tilde O(\varepsilon^{-p}d^{p/2})$ for $p>2$ independently of $m$ (i.e., dimension-free) that approximate the multiple $\ell_p$ regression objective at every point in the domain up to $(1\pm\varepsilon)$ relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an $\varepsilon$ factor for all $p>1$. All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate $\ell_p$ Euclidean power means up to a $(1+\varepsilon)$ factor, showing that $\tilde\Theta(\varepsilon^{-2})$ samples for $p = 1$, $\tilde\Theta(\varepsilon^{-1})$ samples for $1 < p < 2$, and $\tilde\Theta(\varepsilon^{1-p})$ samples for $p>2$ is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for $1<p<2$, every matrix has a subset of $\tilde O(\varepsilon^{-1}k)$ rows which spans a $(1+\varepsilon)$-approximately optimal $k$-dimensional subspace for $\ell_p$ subspace approximation, which is also nearly optimal.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Woodruff and Yasuda. "Coresets for Multiple $\ell_p$ Regression." International Conference on Machine Learning, 2024.

Markdown

[Woodruff and Yasuda. "Coresets for Multiple $\ell_p$ Regression." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/woodruff2024icml-coresets/)

BibTeX

@inproceedings{woodruff2024icml-coresets,
  title     = {{Coresets for Multiple $\ell_p$ Regression}},
  author    = {Woodruff, David and Yasuda, Taisuke},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {53202-53233},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/woodruff2024icml-coresets/}
}