SPADE: A Semi-Supervised Probabilistic Approach for Detecting Errors in Tables
Abstract
Error detection is one of the most important steps in data cleaning and usually requires extensive human interaction to ensure quality. Existing supervised methods in error detection require a significant amount of training data while unsupervised methods rely on fixed inductive biases, which are usually hard to generalize, to solve the problem. In this paper, we present SPADE, a novel semi-supervised probabilistic approach for error detection. SPADE introduces a novel probabilistic active learning model, where the system suggests examples to be labeled based on the agreements between user labels and indicative signals, which are designed to capture potential errors. SPADE uses a two-phase data augmentation process to enrich a dataset before training a deep learning classifier to detect unlabeled errors. In our evaluation, SPADE achieves an average F1-score of 0.91 over five datasets and yields a 10% improvement compared with the state-of-the-art systems.
Cite
Text
Pham et al. "SPADE: A Semi-Supervised Probabilistic Approach for Detecting Errors in Tables." International Joint Conference on Artificial Intelligence, 2021. doi:10.24963/IJCAI.2021/488Markdown
[Pham et al. "SPADE: A Semi-Supervised Probabilistic Approach for Detecting Errors in Tables." International Joint Conference on Artificial Intelligence, 2021.](https://mlanthology.org/ijcai/2021/pham2021ijcai-spade/) doi:10.24963/IJCAI.2021/488BibTeX
@inproceedings{pham2021ijcai-spade,
title = {{SPADE: A Semi-Supervised Probabilistic Approach for Detecting Errors in Tables}},
author = {Pham, Minh and Knoblock, Craig A. and Chen, Muhao and Vu, Binh and Pujara, Jay},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2021},
pages = {3543-3551},
doi = {10.24963/IJCAI.2021/488},
url = {https://mlanthology.org/ijcai/2021/pham2021ijcai-spade/}
}