Similarity Encoding for Learning with Dirty Categorical Variables

Abstract

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.

Cite

Text

Cerda et al. "Similarity Encoding for Learning with Dirty Categorical Variables." Machine Learning, 2018. doi:10.1007/S10994-018-5724-2

Markdown

[Cerda et al. "Similarity Encoding for Learning with Dirty Categorical Variables." Machine Learning, 2018.](https://mlanthology.org/mlj/2018/cerda2018mlj-similarity/) doi:10.1007/S10994-018-5724-2

BibTeX

@article{cerda2018mlj-similarity,
  title     = {{Similarity Encoding for Learning with Dirty Categorical Variables}},
  author    = {Cerda, Patricio and Varoquaux, Gaël and Kégl, Balázs},
  journal   = {Machine Learning},
  year      = {2018},
  pages     = {1477-1494},
  doi       = {10.1007/S10994-018-5724-2},
  volume    = {107},
  url       = {https://mlanthology.org/mlj/2018/cerda2018mlj-similarity/}
}