Feature Selection in SVM Text Categorization

Abstract

This paper investigates the effect of prior feature selection in Support Vector Machine (SVM) text categorization. The input space was gradually increased by using mutual information (MI) fil-tering and part-of-speech (POS) filtering, which determine the portion of words that are appro-priate for learning from the information-theoretic and the linguistic perspectives, respectively. We tested the two filtering methods on SVMs as well as a decision tree algorithm C4.5. The SVMs ’ re-sults common to both filtering are that 1) the opti-mal number of features differed completely across categories, and 2) the average performance for all categories was best when all of the words were used. In addition, a comparison of the two filter-ing methods clarified that POS filtering on SVMs consistently outperformed MI filtering, which in-dicates that SVMs cannot find irrelevant parts of speech. These results suggest a simple strategy for the SVM text categorization: use a full number of words found through a rough filtering technique like part-of-speech tagging.

Cite

Text

Taira and Haruno. "Feature Selection in SVM Text Categorization." AAAI Conference on Artificial Intelligence, 1999.

Markdown

[Taira and Haruno. "Feature Selection in SVM Text Categorization." AAAI Conference on Artificial Intelligence, 1999.](https://mlanthology.org/aaai/1999/taira1999aaai-feature/)

BibTeX

@inproceedings{taira1999aaai-feature,
  title     = {{Feature Selection in SVM Text Categorization}},
  author    = {Taira, Hirotoshi and Haruno, Masahiko},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {1999},
  pages     = {480-486},
  url       = {https://mlanthology.org/aaai/1999/taira1999aaai-feature/}
}