Automatic Discovery of the Statistical Types of Variables in a Dataset

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

Cite

Text

Valera and Ghahramani. "Automatic Discovery of the Statistical Types of Variables in a Dataset." International Conference on Machine Learning, 2017.

Markdown

[Valera and Ghahramani. "Automatic Discovery of the Statistical Types of Variables in a Dataset." International Conference on Machine Learning, 2017.](https://mlanthology.org/icml/2017/valera2017icml-automatic/)

BibTeX

@inproceedings{valera2017icml-automatic,
  title     = {{Automatic Discovery of the Statistical Types of Variables in a Dataset}},
  author    = {Valera, Isabel and Ghahramani, Zoubin},
  booktitle = {International Conference on Machine Learning},
  year      = {2017},
  pages     = {3521-3529},
  volume    = {70},
  url       = {https://mlanthology.org/icml/2017/valera2017icml-automatic/}
}