Active Learning for Hierarchical Wrapper Induction
Abstract
Information mediators that allow users to integrate data from several Web sources rely on wrappers that extract the relevant data from the Web documents. Wrappers turn col-lections of Web pages into database-like tables by applying a set of extraction rules to each individual document. Even though the extraction rules can be written by humans, this is undesirable because the process is tedious, time consuming, and requires a high level of expertise. As an alternative to manually writing extraction rules, we created STALKER (Muslea, Minton, & Knoblock 1999), which is a wrapper induction algorithm that learns high-accuracy extraction rules. The major novelty introduced by STALKER is the concept of hierarchical wrapper induction: the extraction of the relevant data is performed in a hierar-chical manner based on the embedded catalog tree (ECT), which is a user-provided description of the information to be extracted. Consider the sample document
Cite
Text
Muslea et al. "Active Learning for Hierarchical Wrapper Induction." AAAI Conference on Artificial Intelligence, 1999. doi:10.1007/bf02235647Markdown
[Muslea et al. "Active Learning for Hierarchical Wrapper Induction." AAAI Conference on Artificial Intelligence, 1999.](https://mlanthology.org/aaai/1999/muslea1999aaai-active/) doi:10.1007/bf02235647BibTeX
@inproceedings{muslea1999aaai-active,
title = {{Active Learning for Hierarchical Wrapper Induction}},
author = {Muslea, Ion and Minton, Steven and Knoblock, Craig A.},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {1999},
pages = {975},
doi = {10.1007/bf02235647},
url = {https://mlanthology.org/aaai/1999/muslea1999aaai-active/}
}