Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge

Abstract

Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if the strings “17th of August of 2017” and “2017-08-17” are to be formatted into “08/17/2017” to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.

Cite

Text

Ochando et al. "Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019. doi:10.1007/978-3-030-46133-1_44

Markdown

[Ochando et al. "Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019.](https://mlanthology.org/ecmlpkdd/2019/ochando2019ecmlpkdd-automated/) doi:10.1007/978-3-030-46133-1_44

BibTeX

@inproceedings{ochando2019ecmlpkdd-automated,
  title     = {{Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge}},
  author    = {Ochando, Lidia Contreras and Ferri, Cèsar and Hernández-Orallo, José and Martínez-Plumed, Fernando and Ramírez-Quintana, María José and Katayama, Susumu},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2019},
  pages     = {735-751},
  doi       = {10.1007/978-3-030-46133-1_44},
  url       = {https://mlanthology.org/ecmlpkdd/2019/ochando2019ecmlpkdd-automated/}
}