Recognizing Structure in Web Pages Using Similarity Queries

Abstract

We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first by our method is "meaningful"---i.e., a structure that was used in a hand-coded "wrapper", or extraction program, for the page---nearly 70% of the time. This improves on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-ranked structure is meaningful nearly 85% of the time. Introduction Web-based information integration systems allow a user to query structured information that has been extracted from the Web (Levy, Rajaraman, & Ordille 1996; Garcia...

Cite

Text

Cohen. "Recognizing Structure in Web Pages Using Similarity Queries." AAAI Conference on Artificial Intelligence, 1999.

Markdown

[Cohen. "Recognizing Structure in Web Pages Using Similarity Queries." AAAI Conference on Artificial Intelligence, 1999.](https://mlanthology.org/aaai/1999/cohen1999aaai-recognizing/)

BibTeX

@inproceedings{cohen1999aaai-recognizing,
  title     = {{Recognizing Structure in Web Pages Using Similarity Queries}},
  author    = {Cohen, William W.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {1999},
  pages     = {59-66},
  url       = {https://mlanthology.org/aaai/1999/cohen1999aaai-recognizing/}
}