Feature Generation for Text Categorization Using World Knowledge

Abstract

We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field. 1

Cite

Text

Gabrilovich and Markovitch. "Feature Generation for Text Categorization Using World Knowledge." International Joint Conference on Artificial Intelligence, 2005.

Markdown

[Gabrilovich and Markovitch. "Feature Generation for Text Categorization Using World Knowledge." International Joint Conference on Artificial Intelligence, 2005.](https://mlanthology.org/ijcai/2005/gabrilovich2005ijcai-feature/)

BibTeX

@inproceedings{gabrilovich2005ijcai-feature,
  title     = {{Feature Generation for Text Categorization Using World Knowledge}},
  author    = {Gabrilovich, Evgeniy and Markovitch, Shaul},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2005},
  pages     = {1048-1053},
  url       = {https://mlanthology.org/ijcai/2005/gabrilovich2005ijcai-feature/}
}