Exploiting Extremely Rare Features in Text Categorization

Schönhofen, Péter; Benczúr, András A.

doi:10.1007/11871842_77

Exploiting Extremely Rare Features in Text Categorization

Péter Schönhofen, András A. Benczúr

ECML-PKDD 2006 pp. 759-766

doi:10.1007/11871842_77 /ecmlpkdd/2006/schonhofen2006ecml-exploiting/

Abstract

One of the first steps of document classification, clustering and many other information retrieval tasks is to discard words occurring only a few times in the corpus, based on the assumption that they have little contribution to the bag of words representation. However, as we will show, rare n -grams and other similar features are able to indicate surprisingly well if two documents belong to the same category, and thus can aid classification. In our experiments over four corpora, we found that while keeping the size of the training set constant, 5-25% of the test set can be classified essentially for free based on rare features without any loss of accuracy, even experiencing an improvement of 0.6-1.6%.

PDF ECML-PKDD Semantic Scholar

Cite

Text

Schönhofen and Benczúr. "Exploiting Extremely Rare Features in Text Categorization." European Conference on Machine Learning, 2006. doi:10.1007/11871842_77

Markdown

[Schönhofen and Benczúr. "Exploiting Extremely Rare Features in Text Categorization." European Conference on Machine Learning, 2006.](https://mlanthology.org/ecmlpkdd/2006/schonhofen2006ecml-exploiting/) doi:10.1007/11871842_77

BibTeX

@inproceedings{schonhofen2006ecml-exploiting,
  title     = {{Exploiting Extremely Rare Features in Text Categorization}},
  author    = {Schönhofen, Péter and Benczúr, András A.},
  booktitle = {European Conference on Machine Learning},
  year      = {2006},
  pages     = {759-766},
  doi       = {10.1007/11871842_77},
  url       = {https://mlanthology.org/ecmlpkdd/2006/schonhofen2006ecml-exploiting/}
}