Leveraging Unlabeled Data to Scale Blocking for Record Linkage
Abstract
Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.
Cite
Text
Cao et al. "Leveraging Unlabeled Data to Scale Blocking for Record Linkage." International Joint Conference on Artificial Intelligence, 2011. doi:10.5591/978-1-57735-516-8/IJCAI11-369Markdown
[Cao et al. "Leveraging Unlabeled Data to Scale Blocking for Record Linkage." International Joint Conference on Artificial Intelligence, 2011.](https://mlanthology.org/ijcai/2011/cao2011ijcai-leveraging/) doi:10.5591/978-1-57735-516-8/IJCAI11-369BibTeX
@inproceedings{cao2011ijcai-leveraging,
title = {{Leveraging Unlabeled Data to Scale Blocking for Record Linkage}},
author = {Cao, Yunbo and Chen, Zhiyuan and Zhu, Jiamin and Yue, Pei and Lin, Chin-Yew and Yu, Yong},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2011},
pages = {2211-2217},
doi = {10.5591/978-1-57735-516-8/IJCAI11-369},
url = {https://mlanthology.org/ijcai/2011/cao2011ijcai-leveraging/}
}