BlockBoost: Scalable and Efficient Blocking Through Boosting

Abstract

As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.

Cite

Text

Ramos et al. "BlockBoost: Scalable and Efficient Blocking Through Boosting." Artificial Intelligence and Statistics, 2024.

Markdown

[Ramos et al. "BlockBoost: Scalable and Efficient Blocking Through Boosting." Artificial Intelligence and Statistics, 2024.](https://mlanthology.org/aistats/2024/ramos2024aistats-blockboost/)

BibTeX

@inproceedings{ramos2024aistats-blockboost,
  title     = {{BlockBoost: Scalable and Efficient Blocking Through Boosting}},
  author    = {Ramos, Thiago and Loro Schuller, Rodrigo and Akira Okuno, Alex and Nissenbaum, Lucas and Oliveira, Roberto I and Orenstein, Paulo},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2024},
  pages     = {2575-2583},
  volume    = {238},
  url       = {https://mlanthology.org/aistats/2024/ramos2024aistats-blockboost/}
}