Feature Engineering for Text Classification
Abstract
Most research in text classification has used the “bag of words ” representation of text. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify our suspicions that they could have improved the performance of a rule-based learner. The representations are evaluated using the RIPPER rule-based learner on the Reuters-21578 and DigiTrad test corpora, but on their own the new representations are not found to produce a significant performance improvement. Finally, we try combining classifiers based on different representations using a majority voting technique. This step does produce some performance improvement on both test collections. In general, our work supports the emerging consensus in the information retrieval community that more sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced. We conclude that for now, research into new learning algorithms and methods for combining existing learners holds the most promise.
Cite
Text
Scott and Matwin. "Feature Engineering for Text Classification." International Conference on Machine Learning, 1999.Markdown
[Scott and Matwin. "Feature Engineering for Text Classification." International Conference on Machine Learning, 1999.](https://mlanthology.org/icml/1999/scott1999icml-feature/)BibTeX
@inproceedings{scott1999icml-feature,
title = {{Feature Engineering for Text Classification}},
author = {Scott, Sam and Matwin, Stan},
booktitle = {International Conference on Machine Learning},
year = {1999},
pages = {379-388},
url = {https://mlanthology.org/icml/1999/scott1999icml-feature/}
}