A Pitfall and Solution in Multi-Class Feature Selection for Text Classification

Forman, George

doi:10.1145/1015330.1015356

A Pitfall and Solution in Multi-Class Feature Selection for Text Classification

George Forman

ICML 2004

doi:10.1145/1015330.1015356 /icml/2004/forman2004icml-pitfall/

Abstract

Information Gain is a well-known and empirically proven method forhigh-dimensional feature selection. We found that it and other existingmethods failed to produce good results on an industrial text classificationproblem. On investigating the root cause, we find that a large class offeature scoring methods suffers a pitfall: they can be blinded by a surplus ofstrongly predictive features for some classes, while largely ignoring featuresneeded to discriminate difficult classes. In this paper we demonstrate thispitfall hurts performance even for a relatively uniform text classificationtask. Based on this understanding, we present solutions inspired byround-robin scheduling that avoid this pitfall, without resorting to costlywrapper methods. Empirical evaluation on 19 datasets shows substantialimprovements.

PDF ICML Semantic Scholar

Cite

Text

Forman. "A Pitfall and Solution in Multi-Class Feature Selection for Text Classification." International Conference on Machine Learning, 2004. doi:10.1145/1015330.1015356

Markdown

[Forman. "A Pitfall and Solution in Multi-Class Feature Selection for Text Classification." International Conference on Machine Learning, 2004.](https://mlanthology.org/icml/2004/forman2004icml-pitfall/) doi:10.1145/1015330.1015356

BibTeX

@inproceedings{forman2004icml-pitfall,
  title     = {{A Pitfall and Solution in Multi-Class Feature Selection for Text Classification}},
  author    = {Forman, George},
  booktitle = {International Conference on Machine Learning},
  year      = {2004},
  doi       = {10.1145/1015330.1015356},
  url       = {https://mlanthology.org/icml/2004/forman2004icml-pitfall/}
}