A Pitfall and Solution in Multi-Class Feature Selection for Text Classification
Abstract
Information Gain is a well-known and empirically proven method forhigh-dimensional feature selection. We found that it and other existingmethods failed to produce good results on an industrial text classificationproblem. On investigating the root cause, we find that a large class offeature scoring methods suffers a pitfall: they can be blinded by a surplus ofstrongly predictive features for some classes, while largely ignoring featuresneeded to discriminate difficult classes. In this paper we demonstrate thispitfall hurts performance even for a relatively uniform text classificationtask. Based on this understanding, we present solutions inspired byround-robin scheduling that avoid this pitfall, without resorting to costlywrapper methods. Empirical evaluation on 19 datasets shows substantialimprovements.
Cite
Text
Forman. "A Pitfall and Solution in Multi-Class Feature Selection for Text Classification." International Conference on Machine Learning, 2004. doi:10.1145/1015330.1015356Markdown
[Forman. "A Pitfall and Solution in Multi-Class Feature Selection for Text Classification." International Conference on Machine Learning, 2004.](https://mlanthology.org/icml/2004/forman2004icml-pitfall/) doi:10.1145/1015330.1015356BibTeX
@inproceedings{forman2004icml-pitfall,
title = {{A Pitfall and Solution in Multi-Class Feature Selection for Text Classification}},
author = {Forman, George},
booktitle = {International Conference on Machine Learning},
year = {2004},
doi = {10.1145/1015330.1015356},
url = {https://mlanthology.org/icml/2004/forman2004icml-pitfall/}
}