An Evaluation on Feature Selection for Text Clustering

Abstract

Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of text clustering algorithm. Then we propose a new feature selection method called "Term Contribution (TC)" and perform a comparative study on a variety of feature selection methods for text clustering, including Document Frequency (DF), Term Strength (TS), Entropy-based (En), Information Gain (IG) and x2 statistic (CHI). Finally, we propose an "Iterative Feature Selection (IF)" method that addresses the unavailability of label problem by utilizing effective supervised feature selection method to iteratively select features and perform clustering. Detailed experimental results on Web Directory data are provided in the paper. ICML Proceedings of the Twentieth International Conference on Machine Learning

Cite

Text

Liu et al. "An Evaluation on Feature Selection for Text Clustering." International Conference on Machine Learning, 2003.

Markdown

[Liu et al. "An Evaluation on Feature Selection for Text Clustering." International Conference on Machine Learning, 2003.](https://mlanthology.org/icml/2003/liu2003icml-evaluation/)

BibTeX

@inproceedings{liu2003icml-evaluation,
  title     = {{An Evaluation on Feature Selection for Text Clustering}},
  author    = {Liu, Tao and Liu, Shengping and Chen, Zheng and Ma, Wei-Ying},
  booktitle = {International Conference on Machine Learning},
  year      = {2003},
  pages     = {488-495},
  url       = {https://mlanthology.org/icml/2003/liu2003icml-evaluation/}
}