Generalized Features: Their Application to Classification
Abstract
Classification learning algorithms in general, and text classification methods in particular, tend to focus on features of individual training examples, rather than on the relationships between the examples. However, in many situations a set of items contains more information than just feature values of individual items. For example, taking into account the articles that are cited by or cite an article in question would increase our chances of correct classification. We propose to recognize and put in use generalized features (or set features), which describe a training example, but depend on the dataset as a whole, with the goal of achieving better classification accuracy. Although the idea of generalized features is consistent with the objectives of relational learning (ILP), we feel that instead of using the computationally heavy and conceptually general ILP methods, there may be a benefit in looking for approaches that use specific relations between texts, and in particular, between emails. Generalized features are the way to capture the information that lies beyond a particular item, the information that combines the dataset in some sort of structure. Different datasets have different structures, but we could guess what kind of information would be useful for classification. It is similar to the process of choosing relevant features. For example, we can guess that the references are relevant to the topic of an article, but the relative length is not. There have been some attempts to include additional information about a dataset to the standard classification process based on plain features. One example is using references to classify technical articles and hyperlinks to classify web pages. This research shows that some links could be confusing while others are very helpful. Another example is character recognition. The recognition process can be based not only on the shape of a character, but also on preceding characters and even preceding words. Our attention is focused on the email classification problem. Nowadays, when a typical user receives about 4050 email messages daily, there is a great need in automatic classification systems that could sort, archive, and filter messages accurately. Typically, people work with emails as with general texts and base the classification decisions on the words that appear in the header and in the body of an
Cite
Text
Kiritchenko and Matwin. "Generalized Features: Their Application to Classification." AAAI Conference on Artificial Intelligence, 2002. doi:10.5555/777092.777258Markdown
[Kiritchenko and Matwin. "Generalized Features: Their Application to Classification." AAAI Conference on Artificial Intelligence, 2002.](https://mlanthology.org/aaai/2002/kiritchenko2002aaai-generalized/) doi:10.5555/777092.777258BibTeX
@inproceedings{kiritchenko2002aaai-generalized,
title = {{Generalized Features: Their Application to Classification}},
author = {Kiritchenko, Svetlana and Matwin, Stan},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2002},
pages = {985},
doi = {10.5555/777092.777258},
url = {https://mlanthology.org/aaai/2002/kiritchenko2002aaai-generalized/}
}