Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
Abstract
Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving join candidates, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes
Cite
Text
Cappuzzo et al. "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes." Transactions on Machine Learning Research, 2025.Markdown
[Cappuzzo et al. "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/cappuzzo2025tmlr-retrieve/)BibTeX
@article{cappuzzo2025tmlr-retrieve,
title = {{Retrieve, Merge, Predict: Augmenting Tables with Data Lakes}},
author = {Cappuzzo, Riccardo and Coelho, Aimee and Lefebvre, Félix and Papotti, Paolo and Varoquaux, Gaël},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/cappuzzo2025tmlr-retrieve/}
}