A Pitfall and Solution in Multi-Class Feature Selection for Text Classification

Abstract

Information Gain is a well-known and empirically proven method forhigh-dimensional feature selection. We found that it and other existingmethods failed to produce good results on an industrial text classificationproblem. On investigating the root cause, we find that a large class offeature scoring methods suffers a pitfall: they can be blinded by a surplus ofstrongly predictive features for some classes, while largely ignoring featuresneeded to discriminate difficult classes. In this paper we demonstrate thispitfall hurts performance even for a relatively uniform text classificationtask. Based on this understanding, we present solutions inspired byround-robin scheduling that avoid this pitfall, without resorting to costlywrapper methods. Empirical evaluation on 19 datasets shows substantialimprovements.

Cite

Text

Forman. "A Pitfall and Solution in Multi-Class Feature Selection for Text Classification." International Conference on Machine Learning, 2004. doi:10.1145/1015330.1015356

Markdown

[Forman. "A Pitfall and Solution in Multi-Class Feature Selection for Text Classification." International Conference on Machine Learning, 2004.](https://mlanthology.org/icml/2004/forman2004icml-pitfall/) doi:10.1145/1015330.1015356

BibTeX

@inproceedings{forman2004icml-pitfall,
  title     = {{A Pitfall and Solution in Multi-Class Feature Selection for Text Classification}},
  author    = {Forman, George},
  booktitle = {International Conference on Machine Learning},
  year      = {2004},
  doi       = {10.1145/1015330.1015356},
  url       = {https://mlanthology.org/icml/2004/forman2004icml-pitfall/}
}