Linear Regression Using Heterogeneous Data Batches

Ayush Jain, Rajat Sen, Weihao Kong, Abhimanyu Das, Alon Orlitsky

NeurIPS 2024

doi:10.52202/079017-2763 /neurips/2024/jain2024neurips-linear/

Abstract

In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and important manifestations where the output is a noisy linear combination of the inputs, and there are $k$ subgroups, each with its own regression vector. Prior work [KSS$^+$20] showed that with abundant small-batches, the regression vectors can be learned with only few, $\tilde\Omega( k^{3/2})$, batches of medium-size with $\tilde\Omega(\sqrt k)$ samples each. However, the paper requires that the input distribution for all $k$ subgroups be isotropic Gaussian, and states that removing this assumption is an ``interesting and challenging problem". We propose a novel gradient-based algorithm that improves on the existing results in several ways. It extends the applicability of the algorithm by: (1) allowing the subgroups' underlying input distributions to be different, unknown, and heavy-tailed; (2) recovering all subgroups followed by a significant proportion of batches even for infinite $k$; (3) removing the separation requirement between the regression vectors; (4) reducing the number of batches and allowing smaller batch sizes.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Jain et al. "Linear Regression Using Heterogeneous Data Batches." Neural Information Processing Systems, 2024. doi:10.52202/079017-2763

Markdown

[Jain et al. "Linear Regression Using Heterogeneous Data Batches." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/jain2024neurips-linear/) doi:10.52202/079017-2763

BibTeX

@inproceedings{jain2024neurips-linear,
  title     = {{Linear Regression Using Heterogeneous Data Batches}},
  author    = {Jain, Ayush and Sen, Rajat and Kong, Weihao and Das, Abhimanyu and Orlitsky, Alon},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2763},
  url       = {https://mlanthology.org/neurips/2024/jain2024neurips-linear/}
}