Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching
Abstract
Matching and merging data from conflicting sources is the bread and butter of data integration, which drives search verticals, e-commerce comparison sites and cyber intelligence. Schema matching lifts data integration - traditionally focused on well-structured data - to highly heterogeneous sources. While schema matching has enjoyed significant success in matching data attributes, inconsistencies can exist at a deeper level, making full integration difficult or impossible. We propose a more fine-grained approach that focuses on correspondences between the values of attributes across data sources. Since the semantics of attribute values derive from their use and co-occurrence, we argue for the suitability of canonical correlation analysis (CCA) and its variants. We demonstrate the superior statistical and computational performance of multiple sparse CCA compared to a suite of baseline algorithms, on two datasets which we are releasing to stimulate further research. Our crowd-annotated data covers both cases that are relatively easy for humans to supply ground-truth, and that are inherently difficult for human computation.
Cite
Text
Lim and Rubinstein. "Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching." AAAI Conference on Artificial Intelligence, 2015. doi:10.1609/AAAI.V29I1.9459Markdown
[Lim and Rubinstein. "Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching." AAAI Conference on Artificial Intelligence, 2015.](https://mlanthology.org/aaai/2015/lim2015aaai-sub/) doi:10.1609/AAAI.V29I1.9459BibTeX
@inproceedings{lim2015aaai-sub,
title = {{Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching}},
author = {Lim, Zhe and Rubinstein, Benjamin I. P.},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2015},
pages = {1791-1797},
doi = {10.1609/AAAI.V29I1.9459},
url = {https://mlanthology.org/aaai/2015/lim2015aaai-sub/}
}