Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Abstract

Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation.

Cite

Text

Gabrilovich and Markovitch. "Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization." Journal of Machine Learning Research, 2007.

Markdown

[Gabrilovich and Markovitch. "Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization." Journal of Machine Learning Research, 2007.](https://mlanthology.org/jmlr/2007/gabrilovich2007jmlr-harnessing/)

BibTeX

@article{gabrilovich2007jmlr-harnessing,
  title     = {{Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization}},
  author    = {Gabrilovich, Evgeniy and Markovitch, Shaul},
  journal   = {Journal of Machine Learning Research},
  year      = {2007},
  pages     = {2297-2345},
  volume    = {8},
  url       = {https://mlanthology.org/jmlr/2007/gabrilovich2007jmlr-harnessing/}
}