Automatic Discovery of the Statistical Types of Variables in a Dataset

Valera, Isabel; Ghahramani, Zoubin

Automatic Discovery of the Statistical Types of Variables in a Dataset

ICML 2017 pp. 3521-3529

/icml/2017/valera2017icml-automatic/

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

PDF ICML Semantic Scholar

Cite

Text

Valera and Ghahramani. "Automatic Discovery of the Statistical Types of Variables in a Dataset." International Conference on Machine Learning, 2017.

Markdown

[Valera and Ghahramani. "Automatic Discovery of the Statistical Types of Variables in a Dataset." International Conference on Machine Learning, 2017.](https://mlanthology.org/icml/2017/valera2017icml-automatic/)

BibTeX

@inproceedings{valera2017icml-automatic,
  title     = {{Automatic Discovery of the Statistical Types of Variables in a Dataset}},
  author    = {Valera, Isabel and Ghahramani, Zoubin},
  booktitle = {International Conference on Machine Learning},
  year      = {2017},
  pages     = {3521-3529},
  volume    = {70},
  url       = {https://mlanthology.org/icml/2017/valera2017icml-automatic/}
}