Learning and Data Selection in Big Datasets

Abstract

Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

Cite

Text

Ghadikolaei et al. "Learning and Data Selection in Big Datasets." International Conference on Machine Learning, 2019.

Markdown

[Ghadikolaei et al. "Learning and Data Selection in Big Datasets." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/ghadikolaei2019icml-learning/)

BibTeX

@inproceedings{ghadikolaei2019icml-learning,
  title     = {{Learning and Data Selection in Big Datasets}},
  author    = {Ghadikolaei, Hossein Shokri and Ghauch, Hadi and Fischione, Carlo and Skoglund, Mikael},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {2191-2200},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/ghadikolaei2019icml-learning/}
}