Linear Regression Using Heterogeneous Data Batches
Abstract
In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and important manifestations where the output is a noisy linear combination of the inputs, and there are $k$ subgroups, each with its own regression vector. Prior work [KSS$^+$20] showed that with abundant small-batches, the regression vectors can be learned with only few, $\tilde\Omega( k^{3/2})$, batches of medium-size with $\tilde\Omega(\sqrt k)$ samples each. However, the paper requires that the input distribution for all $k$ subgroups be isotropic Gaussian, and states that removing this assumption is an ``interesting and challenging problem". We propose a novel gradient-based algorithm that improves on the existing results in several ways. It extends the applicability of the algorithm by: (1) allowing the subgroups' underlying input distributions to be different, unknown, and heavy-tailed; (2) recovering all subgroups followed by a significant proportion of batches even for infinite $k$; (3) removing the separation requirement between the regression vectors; (4) reducing the number of batches and allowing smaller batch sizes.
Cite
Text
Jain et al. "Linear Regression Using Heterogeneous Data Batches." Neural Information Processing Systems, 2024. doi:10.52202/079017-2763Markdown
[Jain et al. "Linear Regression Using Heterogeneous Data Batches." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/jain2024neurips-linear/) doi:10.52202/079017-2763BibTeX
@inproceedings{jain2024neurips-linear,
title = {{Linear Regression Using Heterogeneous Data Batches}},
author = {Jain, Ayush and Sen, Rajat and Kong, Weihao and Das, Abhimanyu and Orlitsky, Alon},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2763},
url = {https://mlanthology.org/neurips/2024/jain2024neurips-linear/}
}