A Comparative Study on Feature Selection in Text Categorization

Abstract

This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggres-sive dimensionality reduction. Five meth-ods were evaluated, including term selection based on document frequency (DF), informa-tion gain (IG), mutual information (MI), a 2-test (CHI), and term strength (TS). We found IG and CHI most eective in our ex-periments. Using IG thresholding with a k-nearest neighbor classier on the Reuters cor-pus, removal of up to 98 % removal of unique terms actually yielded an improved classi-cation accuracy (measured by average preci-sion). DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50 % vocabulary reduction but is not competitive at higher vo-cabulary reduction levels. In contrast, MI had relatively poor performance due to its bias towards favoring rare terms, and its sen-sitivity to probability estimation errors. 1

Cite

Text

Yang and Pedersen. "A Comparative Study on Feature Selection in Text Categorization." International Conference on Machine Learning, 1997.

Markdown

[Yang and Pedersen. "A Comparative Study on Feature Selection in Text Categorization." International Conference on Machine Learning, 1997.](https://mlanthology.org/icml/1997/yang1997icml-comparative/)

BibTeX

@inproceedings{yang1997icml-comparative,
  title     = {{A Comparative Study on Feature Selection in Text Categorization}},
  author    = {Yang, Yiming and Pedersen, Jan O.},
  booktitle = {International Conference on Machine Learning},
  year      = {1997},
  pages     = {412-420},
  url       = {https://mlanthology.org/icml/1997/yang1997icml-comparative/}
}