Improving a Page Classifier with Anchor Extraction and Link Analysis
Abstract
Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that im- proves a simple web page classifier’s performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique significantly and substantially improves the accuracy of a bag-of-words classifier, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classifier; the results are used by a restricted wrapper-learner to propose potential “main-category anchor wrappers”; and finally, these wrappers are used as features by a third learner to find a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classifier.
Cite
Text
Cohen. "Improving a Page Classifier with Anchor Extraction and Link Analysis." Neural Information Processing Systems, 2002.Markdown
[Cohen. "Improving a Page Classifier with Anchor Extraction and Link Analysis." Neural Information Processing Systems, 2002.](https://mlanthology.org/neurips/2002/cohen2002neurips-improving/)BibTeX
@inproceedings{cohen2002neurips-improving,
title = {{Improving a Page Classifier with Anchor Extraction and Link Analysis}},
author = {Cohen, William W.},
booktitle = {Neural Information Processing Systems},
year = {2002},
pages = {1505-1512},
url = {https://mlanthology.org/neurips/2002/cohen2002neurips-improving/}
}