Wrapper Induction for Information Extraction

Abstract

Many Internet information resources present relational data---telephone directories, product catalogs, etc. Because these sites are formatted for people, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, procedures to extract data from information resources. We introduce wrapper induction, a method for automatically constructing wrappers, and identify hlrt, a wrapper class that is efficiently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources. We use PAC analysis to bound the problem's sample complexity, and show that the system degrades gracefully with imperfect labeling knowledge. 1 Introduction The Internet contains many sources of relational data. For example, when queried with a name, email address services return hname; emaili pairs. But because these sites are designed for people, the content is formatted for human browsing (e.g. an html page), rather than for use...

Cite

Text

Kushmerick et al. "Wrapper Induction for Information Extraction." International Joint Conference on Artificial Intelligence, 1997.

Markdown

[Kushmerick et al. "Wrapper Induction for Information Extraction." International Joint Conference on Artificial Intelligence, 1997.](https://mlanthology.org/ijcai/1997/kushmerick1997ijcai-wrapper/)

BibTeX

@inproceedings{kushmerick1997ijcai-wrapper,
  title     = {{Wrapper Induction for Information Extraction}},
  author    = {Kushmerick, Nicholas and Weld, Daniel S. and Doorenbos, Robert B.},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {1997},
  pages     = {729-737},
  url       = {https://mlanthology.org/ijcai/1997/kushmerick1997ijcai-wrapper/}
}